Introduction to clustering
Clustering is an unsupervised task in machine learning that means the data that is fed to an unsupervised algorithm won’t have labeled data. Clustering is the process of grouping the data into different groups that share similar characteristics or similar properties. A cluster is a group of data where each element has the same properties among the elements of the same cluster but has different properties among the elements of another cluster. K-means clustering algorithm is popular algorithm for performing clustering of data points.
Examples of clustering
some of the examples of clustering are as follows
- Customer segmentation having a similar purchase history
- Image segmentation
- Vegetables arranged in the shop
- Items arranged in a shopping mall
- Spam filtering and so on
K-means clustering is a centroid-based clustering technique that groups similar data into different groups. Here, K represents the predefined number of clusters that are needed to be determined from unlabeled data. If k = 3, then the data will be clustered into two clusters, and if k = 5 then the data will be clustered into five clusters.
- Define the value of k i.e number of cluster to be determined
- Randomly select the k different data i.e centroids
- Measure the distance of each point and clusters
- Assign the point to the nearest cluster
- Calculate the mean of each cluster and update the centroid
- Go to step 3 and repeat the next three steps until the centroid doesn’t change
Working mechanism of k-means clustering
Let’s understand how k-means clustering works step by step with necessary figures. The figure below shows the scatterplot of input data that needed to be clustered.
Let’s suppose that we need to cluster these data into two clusters. Hence, we take k = 2, and after randomly selecting the two centroids the data looks like in the figure below
Those stars with yellow color represent two centroids. Till now we’ve plotted data in the scatterplot and assigned two centroids randomly.
Now, the next task is to measure the distance between each data with two centroids separated by a median line represented by the red-colored line. The distance is calculated by using Euclidean distance and is given by
After computing the distance between data points and each cluster, pre-determined clusters will be formed. The following shows the data clustered into two clusters
After clustering data, the mean of each cluster will be calculated and the centroid will be shifted into a new position so that the sum of the distance between data and cluster will be minimum.
After shifting the centroid to a new position, we will again calculate the distance between points and centroids and then we will assign data points to the nearest cluster. We can see that in the above figure, two data that previously belong to the purple cluster are now part of the green cluster which is because of the change in centroid.
After that the two data points belongs to the green cluster.
Again algorithm will calculate the mean of two cluster and check if there is shift in centroid or not. If there is change in centroid then algorithm will repeat all the steps till now until the centroid does not change.
As we can see that centroid has changed it’s position to new position. We’ll calculate the distance between data and centroid as assign the data to the nearest cluster. After the assignment of data to the nearest cluster, no new data has changed it’s position.
The figure above shows that there is no shift in the centroid position with respect to the previous position of the centroid. The figure below shows the final look of the clustering
Hence, this is the final result of k means clustering algorithm.
Selection of best k for clustering
Performance of k-means clustering is based on the value of k i.e number of clusters. So, we should carefully select the value of k. There are various ways of selecting the value of k but in this tutorial, we’ll be going to deploy the Elbow method for determining the best k for our model.
This method computes the value of WCSS(Within Cluster Sum of Squares) for a different range of k(say 1-10). The formula of WCSS is
The steps of finding best k using the Elbow method
- It executes the k-means clustering on the given dataset using different values of k.
- For different values of k, it calculates the value of WCSS.
- Plot those value of WCSS along with the value of corresponding k
- The sharp edge of the curve gives the best value of k with a minimum value of WCSS.
The curve has a sharp bent like elbow hence named as Elbow method.
For demonstration, I’ll be using Jupyter notebook and I’ll use the dataset of iris flower for clustering.
import pandas as pd import matplotlib.pyplot as plt from sklearn.cluster import KMeans df = pd.read_csv("IRIS.csv") data = df.iloc[:, [2,3]].values param =  for i in range(1,11): kmeans = KMeans(n_clusters= i, init = 'k-means++', random_state = 0) kmeans.fit(data) param.append(kmeans.inertia_) plt.plot(range(1,11), param) plt.title("ELbow method") plt.show()
We can see that from the Elbow method the best value of k is 3. Let’s use the value of k for clustering
model = KMeans(n_clusters = 5, init = "k-means++", random_state = 0) model_kmeans = model.fit_predict(data) df['cluster'] = model_kmeans df1 = df[df.cluster == 0] df2 = df[df.cluster == 1] df3 = df[df.cluster == 2] plt.scatter(df1['petal_length'], df1['petal_width'], s= 50, c = "red", label = "Iris-Setosa") plt.scatter(df2['petal_length'], df2['petal_width'], s= 50, c = "blue", label = "Iris-versicolor") plt.scatter(df3['petal_length'], df3['petal_width'], s= 50, c = "green", label = "Iris-virginica") plt.title("Iris Flower Clustering") plt.xlabel("petal length") plt.ylabel("petal_width") plt.legend() plt.show()
Here we can see the result of k means clustering on the iris flower dataset.
Clustering is an unsupervised task in machine learning. K-means clustering is a simple but powerful method of clustering method which is based on a centroid-based technique. We need to define the value of k before going with clustering. Among others, the Elbow method is easy to implement to find the best value of k which calculates the WCSS for each value of k to find the suitable value of k. The selection of the value of k is a crucial step in clustering with k-means clustering.
For more information about k-means clustering follow this link
Happy Learning 🙂