K-Nearest Neighbor(KNN) is a supervised algorithm in machine learning that is used for classification and regression analysis. This algorithm assigns the new data based on how close or how similar the data is with the points on training data. Here, ‘K’ represents the number of neighbors that are considered to classify the new data point. KNN is called a lazy learning algorithm because it uses all the data during training for the classification of a new point. In other words, it doesn’t learn from training data rather it stores data and when new data is introduced it classifies that new point in the course of training.
- Choose the suitable value of k (number of neighbors)
- Calculate the distance between the new point and the k number of the closest point
- Count the number of neighbors in each category
- Assign the new data to that category that has the maximum number of neighbors
- The model is ready
How KNN works?
- Suppose that we have two categories in the input dataset. The diagram shown below shows the input data having two categories, one with red color and another with green color. We will classify the data in white color using KNN.
- The next step is to choose the number of neighbors i.e the value of k. Let’s take the value of k to be 5
- Now, the third step is to calculate the distance between a new point and other points. Here are some of the methods that are used for the calculation of distance
It is a straight line distance between two points. Let (x1, y1) and (x2, y2) be the two points. Above formula will calculate the distance between these two points.
It is the distance between two points along axes at a right angle. Let (x1, y1) and (x2, y2) be the two points. Above formula will calculate the distance between these two points.
- After calculating the distance between the new point and other points, we’ve got the nearest neighbors i.e 5 nearest points with reference to the new point.
- The next step is to count the number of neighbors in each category. As we can see that there are three(3) points in the red category and two(2) points in the green category.
So, new data point belongs to red category.
Python code for implementation of KNN
For demonstration, we’ll be using Jupyter Notebook and we’ll be using Iris flower classification dataset for implementation of KNN.
#importing required models import pandas as pd import numpy as np from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.neighbors import KNeighborsClassifier from sklearn.metrics import classification_report, accuracy_score #read dataframe df = pd.read_csv("IRIS.csv") #target and input variable selection X, y = df.iloc[:,:-1].values, df.iloc[:, -1].values #do train test split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2) #Standardizing the data sc = StandardScaler() X_train_scaled = sc.fit_transform(X_train) X_test_scaled = sc.transform(X_test) #creating model with random value of K knn = KNeighborsClassifier(n_neighbors=5, metric='minkowski', p=2) model= knn.fit(X_train_scaled, y_train) #making prediction pred = model.predict(X_test_scaled) #checking accuracy of model print(classification_report(y_test, pred))
precision recall f1-score support Iris-setosa 1.00 1.00 1.00 9 Iris-versicolor 0.90 0.82 0.86 11 Iris-virginica 0.82 0.90 0.86 10 accuracy 0.90 30 macro avg 0.91 0.91 0.90 30 weighted avg 0.90 0.90 0.90 30
The KNN model performance is pretty good and the accuracy is 90%.
How to select the best value of K
We can go for the error value generated by using different values of k to see at which particular value the error is minimum to select the best value of K.
import matplotlib.pyplot as plt error =  for k in range(1,50): knn = KNeighborsClassifier(n_neighbors=k, metric='minkowski', p=2) model= knn.fit(X_train_scaled, y_train) pred = model.predict(X_test_scaled) error.append(np.mean(pred != y_test)) #plotting graph of error vs value of k plt.plot(range(1, 50), error) plt.xlabel("K") plt.ylabel("Error") plt.title("Best estimation of k ") plt.show()
As we can see that when k = 25 the error is minimum. So, the best value of k is 25. Using the value of k = 25, we can rebuild the model as
knn = KNeighborsClassifier(n_neighbors=25) model= knn.fit(X_train_scaled, y_train) #making prediction pred = model.predict(X_test_scaled) #checking accuracy of model print(classification_report(y_test, pred))
precision recall f1-score support Iris-setosa 1.00 1.00 1.00 12 Iris-versicolor 1.00 1.00 1.00 9 Iris-virginica 1.00 1.00 1.00 9 accuracy 1.00 30 macro avg 1.00 1.00 1.00 30 weighted avg 1.00 1.00 1.00 30
After substituting the best value of k(25) model accuracy has increased to 100%. So, KNN is useful for classification problem in machine learning.
K-Nearest Neighbors (KNN) algorithm is a supervised machine learning algorithm. KNN is applicable for both classification and regression purposes. This algorithm assigns the new data by analyzing how much the data is similar to the specific category. To determine the best k for KNN, it is important to calculate the error associated with different values of k. After calculating the value of error associated with different k, we will choose that value of with low error. Data must be standardize before training the model as KNN is a distance based algorithm for classification.
For more information follow this link
Happy Learning 🙂