Implementation Of K-Means Clustering In Python

Implementation Of K-Means Clustering In Python

Introduction to clustering

Clustering is an unsupervised task in machine learning that means the data that is fed to an unsupervised algorithm won’t have labeled data. Clustering is the process of grouping the data into different groups that share similar characteristics or similar properties. A cluster is a group of data where each element has the same properties among the elements of the same cluster but has different properties among the elements of another cluster. K-means clustering algorithm is popular algorithm for performing clustering of data points.

Examples of clustering

some of the examples of clustering are as follows

  • Customer segmentation having a similar purchase history
  • Image segmentation
  • Vegetables arranged in the shop
  • Items arranged in a shopping mall
  • Spam filtering and so on

K-means Clustering

K-means clustering is a centroid-based clustering technique that groups similar data into different groups. Here, K represents the predefined number of clusters that are needed to be determined from unlabeled data. If k = 3, then the data will be clustered into two clusters, and if k = 5 then the data will be clustered into five clusters.

Algorithm

  1. Define the value of k i.e number of cluster to be determined
  2. Randomly select the k different data i.e centroids
  3. Measure the distance of each point and clusters
  4. Assign the point to the nearest cluster
  5. Calculate the mean of each cluster and update the centroid
  6. Go to step 3 and repeat the next three steps until the centroid doesn’t change
  7. Stop.

Working mechanism of k-means clustering

Step 1

Let’s understand how k-means clustering works step by step with necessary figures. The figure below shows the scatterplot of input data that needed to be clustered.

data for clustering

Step 2

Let’s suppose that we need to cluster these data into two clusters. Hence, we take k = 2, and after randomly selecting the two centroids the data looks like in the figure below

assign centroid for clustering

Those stars with yellow color represent two centroids. Till now we’ve plotted data in the scatterplot and assigned two centroids randomly.

Step 3

Now, the next task is to measure the distance between each data with two centroids separated by a median line represented by the red-colored line. The distance is calculated by using Euclidean distance and is given by

Euclidean distance calculation

 

Step 4

After computing the distance between data points and each cluster, pre-determined clusters will be formed. The following shows the data clustered into two clusters

assign data to cluster

Step 5

After clustering data, the mean of each cluster will be calculated and the centroid will be shifted into a new position so that the sum of the distance between data and cluster will be minimum.

mean calculation to change centroid

Step 6

After shifting the centroid to a new position,  we will again calculate the distance between points and centroids and then we will assign data points to the nearest cluster. We can see that in the above figure, two data that previously belong to the purple cluster are now part of the green cluster which is because of the change in centroid.

assign data to new centroid

After that the two data points belongs to the green cluster.

Again algorithm will calculate the mean of two cluster and check if there is shift in centroid or not. If there is change in centroid then algorithm will repeat all the steps till now until the centroid does not change.

change centroid for clustering

As we can see that centroid has changed it’s position to new position. We’ll calculate the distance between data and centroid as assign the data to the nearest cluster. After the assignment of data to the nearest cluster, no new data has changed it’s position.

data assign to new centroid

Step 7

The figure above shows that there is no shift in the centroid position with respect to the previous position of the centroid. The figure below shows the final look of the clustering

clustering of data points

Hence, this is the final result of k means clustering algorithm.

Selection of best k for clustering

Performance of k-means clustering is based on the value of k i.e number of clusters. So, we should carefully select the value of k. There are various ways of selecting the value of k but in this tutorial, we’ll be going to deploy the Elbow method for determining the best k for our model.

Elbow method

This method computes the value of WCSS(Within Cluster Sum of Squares) for a different range of k(say 1-10). The formula of WCSS is

The steps of finding best k using the Elbow method

  • It executes the k-means clustering on the given dataset using different values of k.
  • For different values of k, it calculates the value of WCSS.
  • Plot those value of WCSS along with the value of corresponding k
  • The sharp edge of the curve gives the best value of k with a minimum value of WCSS.
Elbow method
Source: Wikipedia

The curve has a sharp bent like elbow hence named as Elbow method.

Python code

For demonstration, I’ll be using Jupyter notebook and I’ll use the dataset of iris flower for clustering.

import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
df = pd.read_csv("IRIS.csv")
data = df.iloc[:, [2,3]].values
param = []
for i in range(1,11):
    kmeans = KMeans(n_clusters= i, init = 'k-means++', random_state = 0)
    kmeans.fit(data)
    param.append(kmeans.inertia_)
plt.plot(range(1,11), param)
plt.title("ELbow method")
plt.show()

Output

elbow

We can see that from the Elbow method the best value of k is 3. Let’s use the value of k for clustering

model = KMeans(n_clusters = 5, init = "k-means++", random_state = 0)
model_kmeans = model.fit_predict(data)
df['cluster'] = model_kmeans
df1 = df[df.cluster == 0]
df2 = df[df.cluster == 1]
df3 = df[df.cluster == 2]
plt.scatter(df1['petal_length'], df1['petal_width'], s= 50, c = "red", label = "Iris-Setosa")
plt.scatter(df2['petal_length'], df2['petal_width'], s= 50, c = "blue", label = "Iris-versicolor")
plt.scatter(df3['petal_length'], df3['petal_width'], s= 50, c = "green", label = "Iris-virginica")
plt.title("Iris Flower Clustering")
plt.xlabel("petal length")
plt.ylabel("petal_width")
plt.legend()
plt.show()

Output

cluster

Here we can see the result of k means clustering on the iris flower dataset.

Conclusion

Clustering is an unsupervised task in machine learning. K-means clustering is a simple but powerful method of clustering method which is based on a centroid-based technique. We need to define the value of k before going with clustering. Among others, the Elbow method is easy to implement to find the best value of k which calculates the WCSS for each value of k to find the suitable value of k. The selection of the value of k is a crucial step in clustering with k-means clustering.

For more information about k-means clustering follow this link

Happy Learning 🙂

Leave a Reply