k-means clustering

In machine learning, we are not always provided an objective to optimize, we are not always provided a target label to classify the input data points into. The kinds of problems where we are not provided with an objective or label to classify is termed as an unsupervised learning problem in the domain of AI. In an unsupervised learning problem, we try to model the latent structured information present within the data. Clustering is a type of unsupervised learning problem where we try to group similar data based on their underlying structure into cohorts/clusters. K-means algorithm is a famous clustering algorithm that is ubiquitously used. K represents the number of clusters we are going to classify our data points into.

K-Means Pseudocode

# K-Means Clustering
1. Choose the number of clusters(K) and obtain the data points
2. Place the centroids c_1, c_2, ..... c_ k randomly
3. Repeat steps 4 and 5 until convergence or until the end of a fixed number of iterations
4. for each data point x_i:
   - find the nearest centroid(c_1, c_2 .. c_k)
   - assign the point to that cluster
5. for each cluster j = 1..k
   - new centroid = mean of all points assigned to that cluster
6. End  

How to Choose K?

There would be some instances where we would not know the number of clusters. Then, how can we choose the value for K??? There is a way called the elbow method. In this method, you choose a different number of clusters and start plotting the within-cluster distance to the centroid. The graph looks as below.

From the above graph we can infer that at k=4, the graph reaches an optimum minimum value. Even though the within-cluster distance decreases after 4, we would be doing more computations. Which is just analogous to the law of diminishing returns. Therefore, we choose a value of 4 as the optimum number of clusters. The reason it is named the elbow method is that the optimum number of clusters would represent an elbow joint!

Implementation from scratch

We will be using the Iris dataset to build our algorithm. Even though the Iris dataset has labels, we will be dropping them and use only the feature points to cluster the data. We know that there are 3 clusters(‘Iris-virginica’, ‘Iris-setosa’, ‘Iris-versicolor’). Therefore, k=3 in our case.

import pandas as pd 
import numpy as np
from sklearn.utils import shuffle

 Store the target vaue
classes = df['Species']
 Convert dataframe into list and then into a numpy array
data = df.values.tolist()
data = np.array(data)
 First 135 points are used for training and the rest is used for testing
train_data = data[:135]
test_data = data[135:] 

We load the dataset and drop the target values. We convert the feature points into a numpy array and split it into training and testing data.

  Randomly place the centroids of the three clusters
c1 = [float(np.random.randint(4,8)),float(np.random.randint(1,5)),
float(np.random.randint(1,7)),float(np.random.randint(0,3))]
c2 = [float(np.random.randint(4,8)),float(np.random.randint(1,5)),
float(np.random.randint(1,7)),float(np.random.randint(0,3))]
c3 = [float(np.random.randint(4,8)),float(np.random.randint(1,5)),
float(np.random.randint(1,7)),float(np.random.randint(0,3))]
 Find the eucledian distance between all points the centroid
dis_point_c1 = ((c1[0]-point[0])**2 + (c1[1]-point[1])**2 +
(c1[2]-point[2])**2 + (c1[3]-point[3])**2)**0.5
dis_point_c2 = ((c2[0]-point[0])**2 + (c2[1]-point[1])**2 +
(c2[2]-point[2])**2 + (c2[3]-point[3])**2)**0.5
dis_point_c3 = ((c3[0]-point[0])**2 + (c3[1]-point[1])**2 +
(c3[2]-point[2])**2 + (c3[3]-point[3])**2)**0.5
distances = [dis_point_c1,dis_point_c2,dis_point_c3]
 Store the centroid values to calculate new centroid values
prev_c1 = c1
prev_c2 = c2
prev_c3 = c3
cluster_1 = np.array(cluster_1)
cluster_2 = np.array(cluster_2)
cluster_3 = np.array(cluster_3)
 If centroid values hasn't changed, algorithm has convereged
if(prev_c1 == c1 and prev_c2 == c2 and prev_c3 == c3):
print("Converged")
break
print(epochs)
epochs += 1  

We implement the pseudocode shown above and we can find that our algorithm converges after 6 iterations. We can now input a test data point and find the centroid it is closest to and assign that point to the respective cluster.

pred = []
for point in test_data:
     Find the cluster to which the point is closest to and append
 Print the predictions
print(pred)  

Scikit Learn implementation

The Scikit-learn library once again saves us from writing so many lines of code by providing an abstract level object which we can just use to implement the algorithm.

 from sklearn.cluster import KMeans

clf = KMeans(n_clusters = 3) 
clf.fit(train_data)
pred = clf.predict(test_data) 
Page structure