KNN(K-Nearest Neighbor) In Machine Learning

KNN(K-Nearest Neighbor) In Machine Learning

Introduction

K-Nearest Neighbor(KNN) is a supervised algorithm in machine learning that is used for classification and regression analysis. This algorithm assigns the new data based on how close or how similar the data is with the points on training data. Here, ‘K’ represents the number of neighbors that are considered to classify the new data point. KNN is called a lazy learning algorithm because it uses all the data during training for the classification of a new point. In other words, it doesn’t learn from training data rather it stores data and when new data is introduced it classifies that new point in the course of training.

KNN algorithm

  • Choose the suitable value of k (number of neighbors)
  • Calculate the distance between the new point and the k number of the closest point
  • Count the number of neighbors in each category
  • Assign the new data to that category that has the maximum number of neighbors
  • The model is ready

How KNN works?

  • Suppose that we have two categories in the input dataset. The diagram shown below shows the input data having two categories, one with red color and another with green color. We will classify the data in white color using KNN.

input features for knn

  • The next step is to choose the number of neighbors i.e the value of k. Let’s take the value of k to be 5
  • Now, the third step is to calculate the distance between a new point and other points. Here are some of the methods that are used for the calculation of distance
    • Euclidean distance

      Euclidean distance

      It is a straight line distance between two points. Let (x1, y1) and (x2, y2) be the two points. Above formula will calculate the distance between these two points.

    • Manhattan distance

      Manhattan distance

      It is the distance between two points along axes at a right angle. Let (x1, y1) and (x2, y2) be the two points. Above formula will calculate the distance between these two points.

  • After calculating the distance between the new point and other points, we’ve got the nearest neighbors i.e 5 nearest points with reference to the new point.Euclidean distance for knn
  • The next step is to count the number of neighbors in each category. As we can see that there are three(3) points in the red category and two(2) points in the green category.  Output of knn

So, new data point belongs to red category.

Python code for implementation of KNN

For demonstration, we’ll be using Jupyter Notebook and we’ll be using Iris flower classification dataset for implementation of KNN.

#importing required models
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report, accuracy_score

#read dataframe
df = pd.read_csv("IRIS.csv")

#target and input variable selection
X, y = df.iloc[:,:-1].values, df.iloc[:, -1].values

#do train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

#Standardizing the data
sc = StandardScaler()
X_train_scaled = sc.fit_transform(X_train)
X_test_scaled = sc.transform(X_test)

#creating model with random value of K
knn = KNeighborsClassifier(n_neighbors=5, metric='minkowski', p=2)
model= knn.fit(X_train_scaled, y_train)

#making prediction
pred = model.predict(X_test_scaled)

#checking accuracy of model
print(classification_report(y_test, pred))

Output

                 precision    recall  f1-score   support

    Iris-setosa       1.00      1.00      1.00         9
Iris-versicolor       0.90      0.82      0.86        11
 Iris-virginica       0.82      0.90      0.86        10

       accuracy                           0.90        30
      macro avg       0.91      0.91      0.90        30
   weighted avg       0.90      0.90      0.90        30

The KNN model performance is pretty good and the accuracy is 90%.

How to select the best value of K

We can go for the error value generated by using different values of k to see at which particular value the error is minimum to select the best value of K.

import matplotlib.pyplot as plt
error = []
for k in range(1,50):
    knn = KNeighborsClassifier(n_neighbors=k, metric='minkowski', p=2)
    model= knn.fit(X_train_scaled, y_train)
    pred = model.predict(X_test_scaled)
    error.append(np.mean(pred != y_test))
                         
#plotting graph of error vs value of k                         
plt.plot(range(1, 50), error)
plt.xlabel("K")
plt.ylabel("Error")
plt.title("Best estimation of k ")
plt.show()

Output

estimation of best k in knn

As we can see that when k = 25 the error is minimum. So, the best value of k is 25. Using the value of k = 25, we can rebuild the model as

knn = KNeighborsClassifier(n_neighbors=25)
model= knn.fit(X_train_scaled, y_train)

#making prediction
pred = model.predict(X_test_scaled)

#checking accuracy of model
print(classification_report(y_test, pred))

Output

                 precision    recall  f1-score   support

    Iris-setosa       1.00      1.00      1.00        12
Iris-versicolor       1.00      1.00      1.00         9
 Iris-virginica       1.00      1.00      1.00         9

       accuracy                           1.00        30
      macro avg       1.00      1.00      1.00        30
   weighted avg       1.00      1.00      1.00        30

After substituting the best value of k(25) model accuracy has increased to 100%. So, KNN is useful for classification problem in machine learning.

Conclusion

K-Nearest Neighbors (KNN) algorithm is a supervised machine learning algorithm. KNN is applicable for both classification and regression purposes. This algorithm assigns the new data by analyzing how much the data is similar to the specific category. To determine the best k for KNN, it is important to calculate the error associated with different values of k. After calculating the value of error associated with different k, we will choose that value of with low error. Data must be standardize before training the model as KNN is a distance based algorithm for classification.

For more information follow this link

Happy Learning 🙂

Leave a Reply