Module 3

Dimensionality Reduction and Clustering

Author

Yike Zhang

Published

September 18, 2025

Class Activities

Week 6

Recap

Link to Google Form

Examples

Example 1: Feature Representation and Scaling

We will use the Min-Max scaling and One-Hot encoding in scikit-learn to scale and encode the features of the provided data below.

import pandas as pd
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# Example dataset
data = pd.DataFrame({
    "age": [25, 32, 47, 51, 62],
    "income": [50000, 64000, 120000, 110000, 150000],
    "gender": ["M", "F", "F", "M", "M"]
})

print(f"{data}")

# Define numeric and categorical features
numeric_features = ["age", "income"]
categorical_features = ["gender"]

# Set the Min-Max scaler for numeric features
numeric_transformer = Pipeline(steps=[("scaler", MinMaxScaler())])

# Drop the first category to avoid redundancy
categorical_transformer = Pipeline(steps=[
        ("onehot", OneHotEncoder(drop="first"))])

# Combine transformations
preprocessor = ColumnTransformer(
        transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features)])

# Apply transformation
X_scaled = preprocessor.fit_transform(data)

print("\nTransformed Features:")
print(X_scaled)
   age  income gender
0   25   50000      M
1   32   64000      F
2   47  120000      F
3   51  110000      M
4   62  150000      M

Transformed Features:
[[0.         0.         1.        ]
 [0.18918919 0.14       0.        ]
 [0.59459459 0.7        0.        ]
 [0.7027027  0.6        1.        ]
 [1.         1.         1.        ]]

Example 2: Use PCA to Extract the Secret Message

We plot the first two PCA dimensions to extract a hidden message from a CSV file. The data file can be downloaded here. The dataset has 491 rows and 10 columns in total.
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

df = pd.read_csv("DimensionReduction.csv")

# Plot the first two PCA dimensions
pca = PCA(n_components=2)
pca_result = pca.fit_transform(df.values)

# Make a 2D scatter plot
plt.figure(figsize=(8,6))
plt.scatter(pca_result[:,0], pca_result[:,1], alpha=1)
plt.xlabel("PCA 1")
plt.ylabel("PCA 2")
plt.axis("equal")
plt.grid(True)
plt.show()

Hands-on Practice

We will use the simplified Housing dataset to practice feature scaling. The simplified dataset has 5 rows and 6 columns in total.
import pandas as pd
import numpy as np

# Load the original Housing dataset
original_df = pd.read_csv('housing.csv')
# Use only the first 5 rows for simplicity
simplified_df = original_df.iloc[:5, :]

# Keep only numeric columns for scaling
simplified_df = simplified_df.select_dtypes(include=np.number)
simplified_df
Avg. Area Income Avg. Area House Age Avg. Area Number of Rooms Avg. Area Number of Bedrooms Area Population Price
0 79545.45857 5.682861 7.009188 4.09 23086.80050 1.059034e+06
1 79248.64245 6.002900 6.730821 3.09 40173.07217 1.505891e+06
2 61287.06718 5.865890 8.512727 5.13 36882.15940 1.058988e+06
3 63345.24005 7.188236 5.586729 3.26 34310.24283 1.260617e+06
4 59982.19723 5.040555 7.839388 4.23 26354.10947 6.309435e+05

Q1: Use Min-Max scaling to scale the features in the dataset, and display the scaled dataset. Note that the scaled values should look like the following:

Avg. Area Income Avg. Area House Age Avg. Area Number of Rooms Avg. Area Number of Bedrooms Area Population Price
0 1.000000 0.299070 0.486145 0.490196 0.000000 0.489275
1 0.984828 0.448086 0.391009 0.000000 1.000000 1.000000
2 0.066700 0.384291 1.000000 1.000000 0.807394 0.489223
3 0.171906 1.000000 0.000000 0.083333 0.656869 0.719670
4 0.000000 0.000000 0.769877 0.558824 0.191224 0.000000

Q2: Use Standardization scaling to scale the features in the dataset, and display the scaled dataset. Note that the scaled values should look like the following:

Avg. Area Income Avg. Area House Age Avg. Area Number of Rooms Avg. Area Number of Bedrooms Area Population Price
0 1.232415 -0.391014 -0.126956 0.176724 -1.409777 -0.153146
1 1.198743 0.066992 -0.406144 -1.182694 1.244683 1.400031
2 -0.838872 -0.129083 1.381019 1.590520 0.733419 -0.153305
3 -0.605386 1.763319 -1.553612 -0.951593 0.333855 0.547513
4 -0.986900 -1.310215 0.705693 0.367043 -0.902180 -1.641093

Q3: First of all, we will scale the original Housing dataset using Standardization (Z-Scoring) Scaling technique. After that, we define the column “Price” as the target variable, while all other columns denote as the features. Please print out the head of the dataset that only contains features (the first 5 rows). We will then use PCA to reduce the dimensionality of the feature dataset to two dimensions and plot it afterwards.

import pandas as pd
import numpy as np
original_df = pd.read_csv('housing.csv')
original_df = original_df.select_dtypes(include=np.number)

Print out the head of the dataset that only contains features after Min-Max scaling:

Avg. Area Income Avg. Area House Age Avg. Area Number of Rooms Avg. Area Number of Bedrooms Area Population
0 1.028660 -0.296927 0.021274 0.088062 -1.317599
1 1.000808 0.025902 -0.255506 -0.722301 0.403999
2 -0.684629 -0.112303 1.516243 0.930840 0.072410
3 -0.491499 1.221572 -1.393077 -0.584540 -0.186734
4 -0.807073 -0.944834 0.846742 0.201513 -0.988387

The resulting 2D PCA plot should look like the following:

Q4: Use PCA to train the linear regression model on the min-max scaled dataset that contains features and target variable from Q3. The train and test split ratio is 80% and 20%, respectively. Compare the MSE score for model evaluation at the end.

MSE result: 0.856107037084032

Week 7

Recap

Link to Google Form

Examples

Example 1: Interactive K-Means Clustering Plot

Below is an interactive k-means clustering visualization with animated iterations and a Voronoi overlay. You can adjust the number of clusters using the slider. The data points are randomly initialized from a CSV file hosted on GitHub. The visualization updates the clusters and centroids over 10 iterations.

You pick the number of clusters k, then the plot:

  • randomly seeds k centroids,
  • repeatedly assigns each point to the nearest centroid,
  • updates each centroid to the mean of its assigned points,
  • animates those changes over ~10 iterations,
  • shows a Voronoi partition (the colored regions), the data points, and the centroid markers.

Example 2: Implementing K-Means Clustering from Scratch

In this example, we implement K-Means Clustering from scratch using Python. The dataset we use can be downloaded here.
import pandas as pd
import numpy as np
import random as rd
import matplotlib.pyplot as plt
import random

random.seed(42)
np.random.seed(42)

data = pd.read_csv('clustering.csv')
data.head()

X = data[["LoanAmount","ApplicantIncome"]].copy()
plt.scatter(X["ApplicantIncome"],X["LoanAmount"],c='black')
plt.xlabel('Applicant Income')
plt.ylabel('Loan Amount (In Thousands)')
plt.show()

# Step 1 - Choose the number of clusters (k) 
# Number of clusters
K = 3

# Step 2 - Select random centroid for each cluster
# Select random observation as centroids
CTs = (X.sample(n = K))
print("Randomly Selected Centroids:")
print(CTs)
plt.scatter(X["ApplicantIncome"], X["LoanAmount"], c='black')
plt.scatter(CTs["ApplicantIncome"], CTs["LoanAmount"], c='red', s=200)
plt.xlabel('Applicant Income')
plt.ylabel('Loan Amount (In Thousands)')
plt.show()

diff = 1
j = 0

# Repeat until centroids stop moving (diff == 0)
while(diff!=0):
    # XD is just another reference to X (not a real copy)
    XD = X
    i = 1

    # Step 3 - Assign all the points to the closest cluster centroid
    for index1,row_c in CTs.iterrows():
        ED=[] # list to store Euclidean distances for this centroid

        # Compute distance from this centroid to every data point
        for index2,row_d in XD.iterrows():
            d1 = (row_c["ApplicantIncome"] - row_d["ApplicantIncome"])**2
            d2 = (row_c["LoanAmount"] - row_d["LoanAmount"])**2
            d = np.sqrt(d1 + d2)
            ED.append(d)

        # Save the computed distances as a new column in X
        # Each column corresponds to distances from one centroid
        X[i] = ED
        i = i + 1

    # Step 4 - Recompute centroids of newly formed clusters
    C = [] # stores cluster assignments
    for index, row in X.iterrows():
        # Start by assuming the first centroid is closest
        min_dist = row[1]
        pos = 1

        # Compare distances to all centroids
        for i in range(K):
            if row[i + 1] < min_dist: # if closer than current min
                min_dist = row[i + 1]
                pos = i + 1 # cluster index (1-based)
        C.append(pos)

    # Add a "Cluster" column to X to mark assignments
    X["Cluster"] = C

    # Step 5 - Repeat step 3 and 4 until the centroids don't change
    NCTs = X.groupby(["Cluster"]).mean()[["LoanAmount","ApplicantIncome"]]

    # On the first iteration, force loop to continue
    if j == 0:
        diff = 1
        j = j + 1
    else:
        # Compute how much centroids have moved
        loan_diff = NCTs['LoanAmount'] - CTs['LoanAmount']
        loan_diff_sum = loan_diff.sum()
        income_diff = NCTs['ApplicantIncome'] - CTs['ApplicantIncome']
        income_diff_sum = income_diff.sum()
        diff = loan_diff_sum + income_diff_sum
        # Print shift in centroids
        print(f"Centroids have moved by {diff}")
        if diff == 0: print("Centroids have stabilized and stopped moving.")

    # Update centroids for next iteration
    CTs = X.groupby(["Cluster"]).mean()[["LoanAmount","ApplicantIncome"]]

# Plot the clusters
color=['blue','green','cyan']
for k in range(K):
    data = X[X["Cluster"] == k+1]
    plt.scatter(data["ApplicantIncome"], data["LoanAmount"], c=color[k])
plt.scatter(CTs["ApplicantIncome"], CTs["LoanAmount"], c='red', s=200)
plt.xlabel('Applicant Income')
plt.ylabel('Loan Amount (In Thousands)')
plt.show()

Randomly Selected Centroids:
     LoanAmount  ApplicantIncome
266       138.0             5829
192        96.0             1625
46         99.0             3029

Centroids have moved by 357.6547686061009
Centroids have moved by 234.65676147361307
Centroids have moved by 241.81286394610947
Centroids have moved by 277.68763984371935
Centroids have moved by 244.66095351174067
Centroids have moved by 229.06905235705375
Centroids have moved by 218.24897861156342
Centroids have moved by 107.07928213052429
Centroids have moved by 52.84741626127729
Centroids have moved by 98.54724443834282
Centroids have moved by 90.64953219227577
Centroids have moved by 18.274686272279013
Centroids have moved by 9.21023994083339
Centroids have moved by 18.345487493007468
Centroids have moved by 46.27013250786139
Centroids have moved by 0.0
Centroids have stabilized and stopped moving.

Example 3: Implementing K-Means Clustering using Scikit-Learn

Similar to Example 2, we are now implementing K-Means Clustering using Scikit-Learn

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
data = pd.read_csv('clustering.csv')
data.head()
X = data[["LoanAmount","ApplicantIncome"]]
plt.scatter(X["ApplicantIncome"],X["LoanAmount"],c='black')
plt.xlabel('Applicant Income')
plt.ylabel('Loan Amount (In Thousands)')
plt.show()

K=3
kmeans = KMeans(n_clusters=K)
kmeans.fit(X)
y_kmeans = kmeans.predict(X)
plt.scatter(X["ApplicantIncome"],X["LoanAmount"],c=y_kmeans,cmap='viridis')
centers = kmeans.cluster_centers_
plt.scatter(centers[:, 1], centers[:, 0], c='red', s=200, alpha=0.75)
plt.xlabel('Applicant Income')
plt.ylabel('Loan Amount (In Thousands)')
plt.show()

Example 4: Elbow Method to Determine the Optimal K

Similar to Example 2 and 3, we are now using the Elbow Method to determine the optimal number of clusters (k) for K-Means Clustering instead of pre-defining it.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans

# Load Data
data = pd.read_csv('clustering.csv')
data.head()

# Plot the Original Data
X = data[["LoanAmount","ApplicantIncome"]]
plt.scatter(X["ApplicantIncome"],X["LoanAmount"],c='black')
plt.xlabel('Applicant Income')
plt.ylabel('Loan Amount (In Thousands)')
plt.show()

# Calculate WCSS for different values of k
wcss = []
K_range = range(1,10)
for k in K_range:
    kmeans = KMeans(n_clusters=k)
    kmeans.fit(X)
    wcss.append(kmeans.inertia_)

# Plot the elbow line at the optimal k
plt.plot(K_range, wcss, marker='o')
plt.xlabel('Number of clusters (k)')
plt.ylabel('Within-cluster Sum of Squares (WCSS)')
plt.title('Elbow Method For the Optimal k')
plt.axvline(x=3, color='r', linestyle='--', label='Optimal k=3')
plt.legend()
plt.show()

Hands-on Practice

We will be working on a wholesale customer segmentation problem, and the dataset can be downloaded <a href=“wholesale_customers.csv”, download>here. The aim of this problem is to segment the clients of a wholesale distributor based on their annual spending on diverse product categories, like milk, grocery, region, and etc. The dataset are loaded in the following code cell.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

data=pd.read_csv("wholesale_customers.csv")
data.head()
Channel Region Fresh Milk Grocery Frozen Detergents_Paper Delicassen
0 2 3 12669 9656 7561 214 2674 1338
1 2 3 7057 9810 9568 1762 3293 1776
2 2 3 6353 8808 7684 2405 3516 7844
3 1 3 13265 1196 4221 6404 507 1788
4 2 3 22615 5410 7198 3915 1777 5185

Q1: Please bring all the variables to the same magnitude using Standardization scaling technique, and display the head of the scaled dataset. Note that the scaled values should look like the following:

Head of Scaled Data:
Channel Region Fresh Milk Grocery Frozen Detergents_Paper Delicassen
0 1.448652 0.590668 0.052933 0.523568 -0.041115 -0.589367 -0.043569 -0.066339
1 1.448652 0.590668 -0.391302 0.544458 0.170318 -0.270136 0.086407 0.089151
2 1.448652 0.590668 -0.447029 0.408538 -0.028157 -0.137536 0.133232 2.243293
3 -0.690297 0.590668 0.100111 -0.624020 -0.392977 0.687144 -0.498588 0.093411
4 1.448652 0.590668 0.840239 -0.052396 -0.079356 0.173859 -0.231918 1.299347

Q2: Use the Elbow Method to determine the optimal number of clusters (k) for K-Means Clustering. The WCSS plot should look like the following:

Q3: Use K-Means Clustering to segment the customers into 6 clusters based on their annual spending on diverse product categories. Display the head of the dataset that contains the original features and the cluster assignments. Note that the head of the dataset should look like the following:

Channel Region Fresh Milk Grocery Frozen Detergents_Paper Delicassen Cluster
0 2 3 12669 9656 7561 214 2674 1338 0
1 2 3 7057 9810 9568 1762 3293 1776 0
2 2 3 6353 8808 7684 2405 3516 7844 0
3 1 3 13265 1196 4221 6404 507 1788 1
4 2 3 22615 5410 7198 3915 1777 5185 0