import math# probability of class "buy"p_buy =0.40# probability of class "not buy"p_not_buy =0.60# calculate entropyentropy =-(p_buy * math.log2(p_buy) + p_not_buy * math.log2(p_not_buy))# print out the resultprint(entropy)
0.9709505944546686
💡 Using the scikit-learn library in Python
from scipy.stats import entropy# probability of class "buy"p_buy =0.40# probability of class "not buy"p_not_buy =0.60probabilities = [p_buy, p_not_buy]# calculate entropyentropy = entropy(probabilities, base=2)print(entropy)
The example below uses the KNN Classifier to classify the Iris dataset. We will visualize the decision boundary with different values of k (number of closest neighbors).
Example 4: Implementing KNN from Scratch in Python
Given a set of training data points and their corresponding labels, the function knn_predict predicts the label for a new test point based on the majority label of its k-nearest neighbors. We use Counter function from the collections library to find the most common label among the neighbors.
import numpy as npfrom collections import Counter# Defining the Euclidean Distance Function# euclidean_distance function is to calculate euclidean distance between points.def euclidean_distance(point1, point2):return np.sqrt(np.sum((np.array(point1) - np.array(point2))**2))def knn_predict(training_data, training_labels, test_point, k): distances = []for i inrange(len(training_data)): dist = euclidean_distance(test_point, training_data[i]) distances.append((dist, training_labels[i])) distances.sort(key=lambda x: x[0]) k_nearest_labels = [label for _, label in distances[:k]]return Counter(k_nearest_labels).most_common(1)[0][0]training_data = [[1, 2], [2, 3], [3, 4], [6, 7], [7, 8]]training_labels = ['A', 'A', 'A', 'B', 'B']test_point = [4, 5]k =3# Number of the closest neighbors to considerprediction = knn_predict(training_data, training_labels, test_point, k)print(f"The predicted class for the test point is: {prediction}")
The predicted class for the test point is: A
Plot the decision boundary of the KNN classifier implemented from scratch.
Hands-on Practice
Q1: Will you like a movie? Please classify the following movies using Decision Tree Classifier taught in the class.
Movie
Type
Length
IMDb Rating
Liked?
m1
Comedy
Short
7.2
Yes
m2
Drama
Medium
9.3
Yes
m3
Comedy
Medium
5.1
No
m4
Drama
Long
6.9
No
m5
Drama
Medium
8.3
Yes
m6
Drama
Short
4.5
No
m7
Comedy
Short
8.0
Yes
m8
Drama
Medium
7.5
Yes
Q2: Suppose we have height, weight and T-shirt size of some customers and we need to predict the T-shirt size of a new customer given only height and weight information we have. Data including height, weight and T-shirt size information is shown in the table below. New customer named ‘Monica’ has height 161cm and weight 61kg. Can you use KNN Classifier to predict Monica’s T-shirt size? k is set to be 5 in this example.
The example below uses the SVM Classifier to classify the Iris dataset (setosa or versicolor). We will visualize the SVM decision boundary in the plot.
Q1: Consider the car theft problem with attributes Color, Type, Origin, and the target, Stolen can be either Yes or No. Please Use the Naive Bayes Classifier to solve the following problems.
Car Theft Problem
Example No.
Color
Type
Origin
Stolen?
1
Red
Sports
Domestic
Yes
2
Red
Sports
Domestic
No
3
Red
Sports
Domestic
Yes
4
Yellow
Sports
Domestic
No
5
Yellow
Sports
Imported
Yes
6
Yellow
SUV
Imported
No
7
Yellow
SUV
Imported
Yes
8
Yellow
SUV
Domestic
No
9
Red
SUV
Imported
No
10
Red
Sports
Imported
Yes
Q1.1: What is the possibilty of a Red SUV Domestic car being stolen?
Q1.2: What is the possibilty of a Red SUV Domestic car Not being stolen?
Q1.3: Given the possibilites, do you think the Red SUV Domestic car will be stolen or not?
Q2: Could you use Support Vector Machine (SVM) to predict a Pulsar Star? The dataset can be downloaded here. Pulsars are a rare type of Neutron star that produce radio emission detectable here on Earth. They are of considerable scientific interest as probes of space-time, the inter-stellar medium, and states of matter. Classification algorithms in particular are being adopted, which treat the data sets as binary classification problems. Here the legitimate pulsar examples form minority positive class and spurious examples form the majority negative class. The dataset here contains 16,259 spurious examples caused by RFI/noise, and 1,639 real pulsar examples. Each row lists the variables first, and the class label is the final entry. The class labels used are 0 (negative) and 1 (positive).
Attribute Information: Each candidate is described by 8 continuous variables, and a single class variable. The first four are simple statistics obtained from the integrated pulse profile. The remaining four variables are similarly obtained from the DM-SNR curve . These are summarised below:
Mean of the integrated profile.
Standard deviation of the integrated profile.
Excess kurtosis of the integrated profile.
Skewness of the integrated profile.
Mean of the DM-SNR curve.
Standard deviation of the DM-SNR curve.
Excess kurtosis of the DM-SNR curve.
Skewness of the DM-SNR curve.
Class
Example code is provided below for references (contain data preprocessing). The code below only prepares the data. You need to add code for SVM model training and prediction after the data preparation part. We aim to run SVM with default hyperparameters. The model prediction accuracy should be 0.9827 at the end. Note that if you do not have a Python IDE set up, you can run the following code in Google Colab.
import numpy as np # linear algebraimport pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)from sklearn.model_selection import train_test_split # Split the dataset into training and testing setsfrom sklearn.preprocessing import StandardScaler # Feature scalingdf = pd.read_csv("pulsar_stars.csv")print(f"Dataset shape: {df.shape}")# let's preview the dataset# print(df.head(5))print(f"Column names: {df.columns.str.strip()}")# view summary of dataset# print(f"Dataset summary:\n{df.describe()}")X = df.drop(['target_class'], axis=1) # Drop specified labels from rows or columns.y = df['target_class']X_train, X_test, y_train, y_test = train_test_split(X, y, test_size =0.2, random_state =0)print(f"Training set shape: {X_train.shape}, Test set shape: {X_test.shape}")# Feature Scalingscaler = StandardScaler()X_train = scaler.fit_transform(X_train)X_test = scaler.transform(X_test)# Construct DataFramesX_train = pd.DataFrame(X_train, columns=X.columns)X_test = pd.DataFrame(X_test, columns=X.columns)# After data preparation, we now run SVM with default hyperparameters# Code for SVM model training and evaluation goes below ⬇️
Dataset shape: (17898, 9)
Column names: Index(['Mean of the integrated profile',
'Standard deviation of the integrated profile',
'Excess kurtosis of the integrated profile',
'Skewness of the integrated profile', 'Mean of the DM-SNR curve',
'Standard deviation of the DM-SNR curve',
'Excess kurtosis of the DM-SNR curve', 'Skewness of the DM-SNR curve',
'target_class'],
dtype='object')
Training set shape: (14318, 8), Test set shape: (3580, 8)