Module 3

Dimensionality Reduction and Clustering

Author

Yike Zhang

Published

September 18, 2025

Class Activities

Week 6

Recap

Link to Google Form

Examples

Example 1: Feature Representation and Scaling

We will use the Min-Max scaling and One-Hot encoding in scikit-learn to scale and encode the features of the provided data below.

import pandas as pd
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# Example dataset
data = pd.DataFrame({
    "age": [25, 32, 47, 51, 62],
    "income": [50000, 64000, 120000, 110000, 150000],
    "gender": ["M", "F", "F", "M", "M"]
})

print(f"{data}")

# Define numeric and categorical features
numeric_features = ["age", "income"]
categorical_features = ["gender"]

# Set the Min-Max scaler for numeric features
numeric_transformer = Pipeline(steps=[("scaler", MinMaxScaler())])

# Drop the first category to avoid redundancy
categorical_transformer = Pipeline(steps=[
        ("onehot", OneHotEncoder(drop="first"))])

# Combine transformations
preprocessor = ColumnTransformer(
        transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features)])

# Apply transformation
X_scaled = preprocessor.fit_transform(data)

print("\nTransformed Features:")
print(X_scaled)

   age  income gender
0   25   50000      M
1   32   64000      F
2   47  120000      F
3   51  110000      M
4   62  150000      M

Transformed Features:
[[0.         0.         1.        ]
 [0.18918919 0.14       0.        ]
 [0.59459459 0.7        0.        ]
 [0.7027027  0.6        1.        ]
 [1.         1.         1.        ]]

Example 2: Use PCA to Extract the Secret Message

We plot the first two PCA dimensions to extract a hidden message from a CSV file. The data file can be downloaded here. The dataset has 491 rows and 10 columns in total.

import pandas as pd
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

df = pd.read_csv("DimensionReduction.csv")

# Plot the first two PCA dimensions
pca = PCA(n_components=2)
pca_result = pca.fit_transform(df.values)

# Make a 2D scatter plot
plt.figure(figsize=(8,6))
plt.scatter(pca_result[:,0], pca_result[:,1], alpha=1)
plt.xlabel("PCA 1")
plt.ylabel("PCA 2")
plt.axis("equal")
plt.grid(True)
plt.show()

Hands-on Practice

We will use the simplified Housing dataset to practice feature scaling. The simplified dataset has 5 rows and 6 columns in total.

import pandas as pd
import numpy as np

# Load the original Housing dataset
original_df = pd.read_csv('housing.csv')
# Use only the first 5 rows for simplicity
simplified_df = original_df.iloc[:5, :]

# Keep only numeric columns for scaling
simplified_df = simplified_df.select_dtypes(include=np.number)
simplified_df

	Avg. Area Income	Avg. Area House Age	Avg. Area Number of Rooms	Avg. Area Number of Bedrooms	Area Population	Price
0	79545.45857	5.682861	7.009188	4.09	23086.80050	1.059034e+06
1	79248.64245	6.002900	6.730821	3.09	40173.07217	1.505891e+06
2	61287.06718	5.865890	8.512727	5.13	36882.15940	1.058988e+06
3	63345.24005	7.188236	5.586729	3.26	34310.24283	1.260617e+06
4	59982.19723	5.040555	7.839388	4.23	26354.10947	6.309435e+05

Q1: Use Min-Max scaling to scale the features in the dataset, and display the scaled dataset. Note that the scaled values should look like the following:

	Avg. Area Income	Avg. Area House Age	Avg. Area Number of Rooms	Avg. Area Number of Bedrooms	Area Population	Price
0	1.000000	0.299070	0.486145	0.490196	0.000000	0.489275
1	0.984828	0.448086	0.391009	0.000000	1.000000	1.000000
2	0.066700	0.384291	1.000000	1.000000	0.807394	0.489223
3	0.171906	1.000000	0.000000	0.083333	0.656869	0.719670
4	0.000000	0.000000	0.769877	0.558824	0.191224	0.000000

Q2: Use Standardization scaling to scale the features in the dataset, and display the scaled dataset. Note that the scaled values should look like the following:

	Avg. Area Income	Avg. Area House Age	Avg. Area Number of Rooms	Avg. Area Number of Bedrooms	Area Population	Price
0	1.232415	-0.391014	-0.126956	0.176724	-1.409777	-0.153146
1	1.198743	0.066992	-0.406144	-1.182694	1.244683	1.400031
2	-0.838872	-0.129083	1.381019	1.590520	0.733419	-0.153305
3	-0.605386	1.763319	-1.553612	-0.951593	0.333855	0.547513
4	-0.986900	-1.310215	0.705693	0.367043	-0.902180	-1.641093

Q3: First of all, we will scale the original Housing dataset using Standardization (Z-Scoring) Scaling technique. After that, we define the column “Price” as the target variable, while all other columns denote as the features. Please print out the head of the dataset that only contains features (the first 5 rows). We will then use PCA to reduce the dimensionality of the feature dataset to two dimensions and plot it afterwards.

import pandas as pd
import numpy as np
original_df = pd.read_csv('housing.csv')
original_df = original_df.select_dtypes(include=np.number)

Print out the head of the dataset that only contains features after Min-Max scaling:

	Avg. Area Income	Avg. Area House Age	Avg. Area Number of Rooms	Avg. Area Number of Bedrooms	Area Population
0	1.028660	-0.296927	0.021274	0.088062	-1.317599
1	1.000808	0.025902	-0.255506	-0.722301	0.403999
2	-0.684629	-0.112303	1.516243	0.930840	0.072410
3	-0.491499	1.221572	-1.393077	-0.584540	-0.186734
4	-0.807073	-0.944834	0.846742	0.201513	-0.988387

The resulting 2D PCA plot should look like the following:

Q4: Use PCA to train the linear regression model on the min-max scaled dataset that contains features and target variable from Q3. The train and test split ratio is 80% and 20%, respectively. Compare the MSE score for model evaluation at the end.

MSE result: 0.856107037084032

Week 7

Recap

Link to Google Form

Examples

Example 1: Interactive K-Means Clustering Plot

Below is an interactive k-means clustering visualization with animated iterations and a Voronoi overlay. You can adjust the number of clusters using the slider. The data points are randomly initialized from a CSV file hosted on GitHub. The visualization updates the clusters and centroids over 10 iterations.

You pick the number of clusters k, then the plot:

randomly seeds k centroids,
repeatedly assigns each point to the nearest centroid,
updates each centroid to the mean of its assigned points,
animates those changes over ~10 iterations,
shows a Voronoi partition (the colored regions), the data points, and the centroid markers.

import {slider, select} from "@jashkenas/inputs"

d3 = require("d3@5")

height = 460.8
margin = {
  return {top: 30, right: 20, bottom: 20, left: 40}
}
xScale = d3.scaleLinear()
      .domain([0,d3.max(data.map(d=>d.x))])
      .range([margin.left, width - margin.right])

yScale = d3.scaleLinear()
      .domain([0,d3.max(data.map(d=>d.y))])
      .range([height - margin.bottom, margin.top])

data = d3.csv("https://raw.githubusercontent.com/will-pang/maching_learning_from_scratch/main/K_Means/data.csv",
                   ({x, y}) => ({x: +x, y: +y}))
viewof k= Inputs.range([1, data.length], {value: 3, step: 1, label: "Clusters"})
html`<div style="margin-bottom: 20px;"></div>`

function darken(color, k) {
  return d3.color(color).darker(k).toString();
}

function update(root) {
  const t = d3.transition();

  root.selectAll('.clusters path').data(voronoi.polygons(centroids))
    .transition(t)
    .attr('d', d => d == null ? null : 'M' + d.join('L') + 'Z');

  root.selectAll('.dots circle')
    .transition(t)
    .attr('fill', d => color_scheme_1(d.cluster))
    .attr('cx', d => xScale(d.x))
    .attr('cy', d => yScale(d.y));
  
  root.selectAll('.centers circle')
    .transition(t)
    .attr('cx', d => xScale(d.x))
    .attr('cy', d => yScale(d.y));  
}

centroids = {
  restart;

  return d3.range(k).map(() => {
    return {
      x: data.map(item => item.x)[getRandomInt(data.length)],
      y: data.map(item => item.y)[getRandomInt(data.length)]
  }
})
}

voronoi = d3.voronoi()
  .x(d => xScale(d.x))
  .y(d => yScale(d.y))
  .extent([[0, 0], [width, height]])

color_labels = ["red", "green", "blue", "yellow", "brown", "orange"]

color_scheme_1 = d3.scaleOrdinal()
          .domain(d3.range(k))
          .range(color_labels.map(d => darken(d, 0)))

color_scheme_2 = d3.scaleOrdinal()
          .domain(d3.range(k))
          .range(color_labels.map(d => darken(d, 1)))

function distance(a,b){
  return Math.sqrt((a.x - b.x)**2 + (a.y - b.y)**2)
}

function getRandomInt(max_value){
  return Math.floor(Math.random(1) * max_value);
}

svg = {
  const root = d3.select(DOM.svg(width, height))
    .style("max-width", "100%")
    .style("height", "auto");

  // Clusters
  root.append('g').attr('class', 'clusters').selectAll('path')
    .data(voronoi.polygons(centroids))
    .enter()
    .append('path')
    .attr('d', d => d == null ? null : 'M' + d.join('L') + 'Z')
    //.attr('fill', 'none')
    .attr('fill', (d, i) => color_scheme_1(i))
    .attr('fill-opacity', 0.3)
    .attr('stroke-width', 0.5)
    .attr('stroke', '#000');

  // Dots
  root.append('g').attr('class', 'dots').selectAll('circle').data(data)
    .enter()
    .append('circle')
    .attr('stroke', '#000')
    .attr('stroke-width', 0)
    .attr('fill-opacity', 1.0)
    .attr('r', 3)
    .attr('fill', d => color_scheme_2(d.cluster))
    .attr('cx', d => xScale(d.x))
    .attr('cy', d => yScale(d.y));

  // Centers
  root.append('g').attr('class', 'centers').selectAll('circle').data(centroids)
    .enter()
    .append('circle')
    .attr('r', 5)
    .attr('fill', '#000')
    .attr('fill-opacity', 0.7)
    .attr('cx', d => xScale(d.x))
    .attr('cy', d => yScale(d.y));

  // Update
  update(root);

  return root;
}

svg.node()

{
  for (let i = 0; i < 10; i++) {
    // Assign datapoints into centroids
    data.forEach(d=> {
      // Compute minimum distance between point and cluster(s) and then assign point to cluster with smallest distance
      d.cluster = d3.scan(centroids, (a,b) => distance(a,d) - distance(b,d));
    });

    // Group points into centroid (note, in previous step, we only assigned point to cluster) and calculate new centroids from each cluster
    d3.nest()
      .key(d => d.cluster)
      .sortKeys(d3.ascending)
      .entries(data)
      .forEach(n => {
        // Get average of each centroid by summing over all x's and all y's and dividing it by the length
        let cx = n.values.map(v => v.x).reduce((a,b) => a+b)/n.values.length
        let cy = n.values.map(v => v.y).reduce((a,b) => a+b)/n.values.length
        // Update centroids
        centroids[+n.key].x = cx;
        centroids[+n.key].y = cy;
      });
    // Update SVG
    update(svg);

    yield md`Iteration: ${await Promises.delay(1000, i + 1)} / 10`;
  }
}

viewof restart = html`<button>Restart`

Example 2: Implementing K-Means Clustering from Scratch

In this example, we implement K-Means Clustering from scratch using Python. The dataset we use can be downloaded here.

import pandas as pd
import numpy as np
import random as rd
import matplotlib.pyplot as plt
import random

random.seed(42)
np.random.seed(42)

data = pd.read_csv('clustering.csv')
data.head()

X = data[["LoanAmount","ApplicantIncome"]].copy()
plt.scatter(X["ApplicantIncome"],X["LoanAmount"],c='black')
plt.xlabel('Applicant Income')
plt.ylabel('Loan Amount (In Thousands)')
plt.show()

# Step 1 - Choose the number of clusters (k) 
# Number of clusters
K = 3

# Step 2 - Select random centroid for each cluster
# Select random observation as centroids
CTs = (X.sample(n = K))
print("Randomly Selected Centroids:")
print(CTs)
plt.scatter(X["ApplicantIncome"], X["LoanAmount"], c='black')
plt.scatter(CTs["ApplicantIncome"], CTs["LoanAmount"], c='red', s=200)
plt.xlabel('Applicant Income')
plt.ylabel('Loan Amount (In Thousands)')
plt.show()

diff = 1
j = 0

# Repeat until centroids stop moving (diff == 0)
while(diff!=0):
    # XD is just another reference to X (not a real copy)
    XD = X
    i = 1

    # Step 3 - Assign all the points to the closest cluster centroid
    for index1,row_c in CTs.iterrows():
        ED=[] # list to store Euclidean distances for this centroid

        # Compute distance from this centroid to every data point
        for index2,row_d in XD.iterrows():
            d1 = (row_c["ApplicantIncome"] - row_d["ApplicantIncome"])**2
            d2 = (row_c["LoanAmount"] - row_d["LoanAmount"])**2
            d = np.sqrt(d1 + d2)
            ED.append(d)

        # Save the computed distances as a new column in X
        # Each column corresponds to distances from one centroid
        X[i] = ED
        i = i + 1

    # Step 4 - Recompute centroids of newly formed clusters
    C = [] # stores cluster assignments
    for index, row in X.iterrows():
        # Start by assuming the first centroid is closest
        min_dist = row[1]
        pos = 1

        # Compare distances to all centroids
        for i in range(K):
            if row[i + 1] < min_dist: # if closer than current min
                min_dist = row[i + 1]
                pos = i + 1 # cluster index (1-based)
        C.append(pos)

    # Add a "Cluster" column to X to mark assignments
    X["Cluster"] = C

    # Step 5 - Repeat step 3 and 4 until the centroids don't change
    NCTs = X.groupby(["Cluster"]).mean()[["LoanAmount","ApplicantIncome"]]

    # On the first iteration, force loop to continue
    if j == 0:
        diff = 1
        j = j + 1
    else:
        # Compute how much centroids have moved
        loan_diff = NCTs['LoanAmount'] - CTs['LoanAmount']
        loan_diff_sum = loan_diff.sum()
        income_diff = NCTs['ApplicantIncome'] - CTs['ApplicantIncome']
        income_diff_sum = income_diff.sum()
        diff = loan_diff_sum + income_diff_sum
        # Print shift in centroids
        print(f"Centroids have moved by {diff}")
        if diff == 0: print("Centroids have stabilized and stopped moving.")

    # Update centroids for next iteration
    CTs = X.groupby(["Cluster"]).mean()[["LoanAmount","ApplicantIncome"]]

# Plot the clusters
color=['blue','green','cyan']
for k in range(K):
    data = X[X["Cluster"] == k+1]
    plt.scatter(data["ApplicantIncome"], data["LoanAmount"], c=color[k])
plt.scatter(CTs["ApplicantIncome"], CTs["LoanAmount"], c='red', s=200)
plt.xlabel('Applicant Income')
plt.ylabel('Loan Amount (In Thousands)')
plt.show()

Randomly Selected Centroids:
     LoanAmount  ApplicantIncome
266       138.0             5829
192        96.0             1625
46         99.0             3029

Centroids have moved by 357.6547686061009
Centroids have moved by 234.65676147361307
Centroids have moved by 241.81286394610947
Centroids have moved by 277.68763984371935
Centroids have moved by 244.66095351174067
Centroids have moved by 229.06905235705375
Centroids have moved by 218.24897861156342
Centroids have moved by 107.07928213052429
Centroids have moved by 52.84741626127729
Centroids have moved by 98.54724443834282
Centroids have moved by 90.64953219227577
Centroids have moved by 18.274686272279013
Centroids have moved by 9.21023994083339
Centroids have moved by 18.345487493007468
Centroids have moved by 46.27013250786139
Centroids have moved by 0.0
Centroids have stabilized and stopped moving.

Example 3: Implementing K-Means Clustering using Scikit-Learn

Similar to Example 2, we are now implementing K-Means Clustering using Scikit-Learn

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
data = pd.read_csv('clustering.csv')
data.head()
X = data[["LoanAmount","ApplicantIncome"]]
plt.scatter(X["ApplicantIncome"],X["LoanAmount"],c='black')
plt.xlabel('Applicant Income')
plt.ylabel('Loan Amount (In Thousands)')
plt.show()

K=3
kmeans = KMeans(n_clusters=K)
kmeans.fit(X)
y_kmeans = kmeans.predict(X)
plt.scatter(X["ApplicantIncome"],X["LoanAmount"],c=y_kmeans,cmap='viridis')
centers = kmeans.cluster_centers_
plt.scatter(centers[:, 1], centers[:, 0], c='red', s=200, alpha=0.75)
plt.xlabel('Applicant Income')
plt.ylabel('Loan Amount (In Thousands)')
plt.show()

Example 4: Elbow Method to Determine the Optimal K

Similar to Example 2 and 3, we are now using the Elbow Method to determine the optimal number of clusters (k) for K-Means Clustering instead of pre-defining it.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans

# Load Data
data = pd.read_csv('clustering.csv')
data.head()

# Plot the Original Data
X = data[["LoanAmount","ApplicantIncome"]]
plt.scatter(X["ApplicantIncome"],X["LoanAmount"],c='black')
plt.xlabel('Applicant Income')
plt.ylabel('Loan Amount (In Thousands)')
plt.show()

# Calculate WCSS for different values of k
wcss = []
K_range = range(1,10)
for k in K_range:
    kmeans = KMeans(n_clusters=k)
    kmeans.fit(X)
    wcss.append(kmeans.inertia_)

# Plot the elbow line at the optimal k
plt.plot(K_range, wcss, marker='o')
plt.xlabel('Number of clusters (k)')
plt.ylabel('Within-cluster Sum of Squares (WCSS)')
plt.title('Elbow Method For the Optimal k')
plt.axvline(x=3, color='r', linestyle='--', label='Optimal k=3')
plt.legend()
plt.show()

Hands-on Practice

We will be working on a wholesale customer segmentation problem, and the dataset can be downloaded <a href=“wholesale_customers.csv”, download>here. The aim of this problem is to segment the clients of a wholesale distributor based on their annual spending on diverse product categories, like milk, grocery, region, and etc. The dataset are loaded in the following code cell.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

data=pd.read_csv("wholesale_customers.csv")
data.head()

	Channel	Region	Fresh	Milk	Grocery	Frozen	Detergents_Paper	Delicassen
0	2	3	12669	9656	7561	214	2674	1338
1	2	3	7057	9810	9568	1762	3293	1776
2	2	3	6353	8808	7684	2405	3516	7844
3	1	3	13265	1196	4221	6404	507	1788
4	2	3	22615	5410	7198	3915	1777	5185

Q1: Please bring all the variables to the same magnitude using Standardization scaling technique, and display the head of the scaled dataset. Note that the scaled values should look like the following:

Head of Scaled Data:

	Channel	Region	Fresh	Milk	Grocery	Frozen	Detergents_Paper	Delicassen
0	1.448652	0.590668	0.052933	0.523568	-0.041115	-0.589367	-0.043569	-0.066339
1	1.448652	0.590668	-0.391302	0.544458	0.170318	-0.270136	0.086407	0.089151
2	1.448652	0.590668	-0.447029	0.408538	-0.028157	-0.137536	0.133232	2.243293
3	-0.690297	0.590668	0.100111	-0.624020	-0.392977	0.687144	-0.498588	0.093411
4	1.448652	0.590668	0.840239	-0.052396	-0.079356	0.173859	-0.231918	1.299347

Q2: Use the Elbow Method to determine the optimal number of clusters (k) for K-Means Clustering. The WCSS plot should look like the following:

Q3: Use K-Means Clustering to segment the customers into 6 clusters based on their annual spending on diverse product categories. Display the head of the dataset that contains the original features and the cluster assignments. Note that the head of the dataset should look like the following:

	Channel	Region	Fresh	Milk	Grocery	Frozen	Detergents_Paper	Delicassen	Cluster
0	2	3	12669	9656	7561	214	2674	1338	0
1	2	3	7057	9810	9568	1762	3293	1776	0
2	2	3	6353	8808	7684	2405	3516	7844	0
3	1	3	13265	1196	4221	6404	507	1788	1
4	2	3	22615	5410	7198	3915	1777	5185	0