Module 3

Dimensionality Reduction and Clustering

Author

Yike Zhang

Published

September 18, 2026

Class Activities

Week 6

Recap

Link to Google Form

Examples

Example 1: Feature Representation and Scaling

We will use the Min-Max scaling and One-Hot encoding in scikit-learn to scale and encode the features of the provided data below.

import pandas as pd
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# Example dataset
data = pd.DataFrame({
    "age": [25, 32, 47, 51, 62],
    "income": [50000, 64000, 120000, 110000, 150000],
    "gender": ["M", "F", "F", "M", "M"]
})

print(f"{data}")

# Define numeric and categorical features
numeric_features = ["age", "income"]
categorical_features = ["gender"]

# Set the Min-Max scaler for numeric features
numeric_transformer = Pipeline(steps=[("scaler", MinMaxScaler())])

# Drop the first category to avoid redundancy
categorical_transformer = Pipeline(steps=[
        ("onehot", OneHotEncoder(drop="first"))])

# Combine transformations
preprocessor = ColumnTransformer(
        transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features)])

# Apply transformation
X_scaled = preprocessor.fit_transform(data)

print("\nTransformed Features:")
print(X_scaled)
   age  income gender
0   25   50000      M
1   32   64000      F
2   47  120000      F
3   51  110000      M
4   62  150000      M

Transformed Features:
[[0.         0.         1.        ]
 [0.18918919 0.14       0.        ]
 [0.59459459 0.7        0.        ]
 [0.7027027  0.6        1.        ]
 [1.         1.         1.        ]]

Example 2: Use PCA to Extract the Secret Message

We plot the first two PCA dimensions to extract a hidden message from a CSV file. The data file can be downloaded here. The dataset has 491 rows and 10 columns in total.
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

df = pd.read_csv("DimensionReduction.csv")

# Plot the first two PCA dimensions
pca = PCA(n_components=2)
pca_result = pca.fit_transform(df.values)

# Make a 2D scatter plot
plt.figure(figsize=(8,6))
plt.scatter(pca_result[:,0], pca_result[:,1], alpha=1)
plt.xlabel("PCA 1")
plt.ylabel("PCA 2")
plt.axis("equal")
plt.grid(True)
plt.show()

Hands-on Practice

We will use the simplified Housing dataset to practice feature scaling. The simplified dataset has 5 rows and 6 columns in total.
import pandas as pd
import numpy as np

# Load the original Housing dataset
original_df = pd.read_csv('housing.csv')
# Use only the first 5 rows for simplicity
simplified_df = original_df.iloc[:5, :]

# Keep only numeric columns for scaling
simplified_df = simplified_df.select_dtypes(include=np.number)
simplified_df
Avg. Area Income Avg. Area House Age Avg. Area Number of Rooms Avg. Area Number of Bedrooms Area Population Price
0 79545.45857 5.682861 7.009188 4.09 23086.80050 1.059034e+06
1 79248.64245 6.002900 6.730821 3.09 40173.07217 1.505891e+06
2 61287.06718 5.865890 8.512727 5.13 36882.15940 1.058988e+06
3 63345.24005 7.188236 5.586729 3.26 34310.24283 1.260617e+06
4 59982.19723 5.040555 7.839388 4.23 26354.10947 6.309435e+05

Q1: Use Min-Max scaling to scale the features in the dataset, and display the scaled dataset. Note that the scaled values should look like the following:

Avg. Area Income Avg. Area House Age Avg. Area Number of Rooms Avg. Area Number of Bedrooms Area Population Price
0 1.000000 0.299070 0.486145 0.490196 0.000000 0.489275
1 0.984828 0.448086 0.391009 0.000000 1.000000 1.000000
2 0.066700 0.384291 1.000000 1.000000 0.807394 0.489223
3 0.171906 1.000000 0.000000 0.083333 0.656869 0.719670
4 0.000000 0.000000 0.769877 0.558824 0.191224 0.000000

Q2: Use Standardization scaling to scale the features in the dataset, and display the scaled dataset. Note that the scaled values should look like the following:

Avg. Area Income Avg. Area House Age Avg. Area Number of Rooms Avg. Area Number of Bedrooms Area Population Price
0 1.232415 -0.391014 -0.126956 0.176724 -1.409777 -0.153146
1 1.198743 0.066992 -0.406144 -1.182694 1.244683 1.400031
2 -0.838872 -0.129083 1.381019 1.590520 0.733419 -0.153305
3 -0.605386 1.763319 -1.553612 -0.951593 0.333855 0.547513
4 -0.986900 -1.310215 0.705693 0.367043 -0.902180 -1.641093

Q3: First of all, we will scale the original Housing dataset using Standardization (Z-Scoring) Scaling technique. After that, we define the column “Price” as the target variable, while all other columns denote as the features. Please print out the head of the dataset that only contains features (the first 5 rows). We will then use PCA to reduce the dimensionality of the feature dataset to two dimensions and plot it afterwards.

import pandas as pd
import numpy as np
original_df = pd.read_csv('housing.csv')
original_df = original_df.select_dtypes(include=np.number)

Print out the head of the dataset that only contains features after Min-Max scaling:

Avg. Area Income Avg. Area House Age Avg. Area Number of Rooms Avg. Area Number of Bedrooms Area Population
0 1.028660 -0.296927 0.021274 0.088062 -1.317599
1 1.000808 0.025902 -0.255506 -0.722301 0.403999
2 -0.684629 -0.112303 1.516243 0.930840 0.072410
3 -0.491499 1.221572 -1.393077 -0.584540 -0.186734
4 -0.807073 -0.944834 0.846742 0.201513 -0.988387

The resulting 2D PCA plot should look like the following:

Q4: Use PCA to train the linear regression model on the min-max scaled dataset that contains features and target variable from Q3. The train and test split ratio is 80% and 20%, respectively. Compare the MSE score for model evaluation at the end.

MSE result: 0.856107037084032