How To Handle Imbalance Datasets In Machine Learning

How To Handle Imbalance Datasets In Machine Learning

Introduction

Imbalance dataset is such a type of dataset that has an unequal distribution of data among the classes  of classification of datasets. Most machine learning algorithms work well with balanced datasets. Balance datasets are those datasets that have almost equal distribution of data among the different classes of datasets. Let’s say there is a dataset that has 99% data associated with the majority class and only 1% of data with the minority class. This is an example of an unbalanced dataset. Also the dataset that has about 50 – 50 % data on each class is an example of a balanced dataset. We need to handle imbalance datasets for better performance of our model.

Why do we need balanced datasets?

If a machine learning algorithm is trained on an imbalance dataset then the model will get biased towards the majority class. When a machine learning model is fed with a huge dataset that has a large amount of data associated with the majority class, the machine learning model will make classification based on the majority class. 

For example

Let’s say that an imbalance dataset has 95% data that belongs to one class and only 5% of data to another class. When this dataset is fed to a machine learning algorithm, the accuracy will be 95% for sure. But is it the good result that we are seeking? Not at all. Whatever the data you pass to your model for prediction it will predict in favor of the majority class. This will create dumb model that gives accuracy of 95% yet will result in misclassification of minority class. Hence we need to handle imbalance datasets.

Note : For demonstration, i will use Jupyter Notebook and Credit card fraud detection from Kaggle

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

df = pd.read_csv("creditcard.csv")
print("Shape of dataset: ", df.shape)

Output

Shape of dataset:  (284807, 31)

Shape of credit card fraud detection is 284807 x 31. There are 31 columns and 284807 rows in this dataset. Let’s see columns of this dataframe

df.columns.values

Output

array(['Time', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9',
       'V10', 'V11', 'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18',
       'V19', 'V20', 'V21', 'V22', 'V23', 'V24', 'V25', 'V26', 'V27',
       'V28', 'Amount', 'Class'], dtype=object)

Here Class is the output variable and other are input variables. Let’s see amount of data that belongs to each class

sns.countplot(df.Class)

Output

imbalance datasets

We can see that there is unequal distribution of data among class 0 and class 1. Let’s see amount of data associated with class 0 and class 1.

df['Class'].value_counts()

Output

0    284315
1       492
Name: Class, dtype: int64

There are 284315 data belongs to class 0 and 492 data associated to class 1. This is perfect example of imbalance dataset.

Techniques of handling imbalanced datasets

Let’s discuss the techniques that are available for handling imbalance dataset. We will look under sampling technique, over sampling technique and combination of both.

Under sampling technique

In under sampling technique, the amount of data in the majority class will be made equal to the amount of data in the minority class. This technique involves the reduction of data of the majority class. One of the main disadvantages of this technique is that it will result in loss of information. This loss of information can cause a serious impact on performance of the model as the model may not get sufficient information from the dataset.

For example
Let’s say the dataset has 900 data that belongs to class A and 100 data that belongs to class B. In under sampling technique, data in class A will be reduced to 100 to make it equal with data in class B.
For under sampling we need imblearn library. So, make sure to install this library

$pip install imblearn

Let’s implement under sampling using NearMiss

from collections import Counter
from sklearn.model_selection import train_test_split
from imblearn.under_sampling import NearMiss

#dependent and independent features
X = df.drop(columns = ["Class"])
y = df['Class']

#split into train and test
X_train, X_test, y_train, y_test  = train_test_split(X, y, test_size = 0.2)

#undersampling with NearMiss
us = NearMiss()
X_train_resample, y_train_resample = us.fit_resample(X_train, y_train)

print("Original Data: ", Counter(y_train))
print("After Under Sampling: ", Counter(y_train_resample))

Output

Original Data:  Counter({0: 227464, 1: 379})
After Under Sampling:  Counter({0: 379, 1: 379})

We can see that before under sampling there are 227464 data in class 0 and 379 data in class 1. But after under sampling there are equal data in class 0 and class 1.

Oversampling technique

In this technique the data of minority class will be duplicated or generated to make it equal with data of majority class. Duplication of data will not provide any additional information while generation of data randomly can result in noise in the dataset.


For example
Let’s say the dataset has 900 data that belongs to class A and 100 data that belongs to class B. In oversampling technique, data in class B will be increased to 900 to make it equal with data in class A.

#oversampling with RandomOverSampler
ros = RandomOverSampler(random_state = 42)
X_train_resample, y_train_resample = ros.fit_resample(X_train, y_train)

print("Original Data: ", Counter(y_train))
print("After Under Sampling: ", Counter(y_train_resample))

Output

Original Data:  Counter({0: 227464, 1: 379})
After Under Sampling:  Counter({0: 227464, 1: 227464})

We can see that oversampling has increased the minority class data with simply duplicating the previous data. But this duplication of data don’t provide additional information. The dataset will have redundant information in it. If we want to generate new data in our dataframe rather than duplicating the data, we can use SMOTE

from imblearn.over_sampling import SMOTE
smot = SMOTE(random_state = 42)
X_train_resample, y_train_resample = smot.fit_resample(X_train, y_train)

print("Original Data: ", Counter(y))
print("After Under Sampling: ", Counter(y_train_resample))

Output

Original Data:  Counter({0: 227464, 1: 379})
After Under Sampling:  Counter({0: 227464, 1: 227464})

SMOTE generates the random data using the data within dataset and RandomOverSampler duplicates the data of minority class

Mixture of under sampling and oversampling

As we know both over sampling and under sampling have their own disadvantages as mentioned above, we can use mixture of both over sampling technique and under sampling technique. Let’s see how this  is implemented

from imblearn.combine import SMOTETomek
combine = SMOTETomek(random_state = 42)
X_train_resample, y_train_resample = combine.fit_resample(X_train, y_train)

print("Original Data: ", Counter(y))
print("After Under Sampling: ", Counter(y_train_resample))

Output

Original Data:  Counter({0: 227466, 1: 379})
After Under Sampling:  Counter({0: 227466, 1: 227466})

Conclusion

Machine learning algorithms work well with balanced datasets. To eliminate bias towards majority class of a machine learning model we need to handle imbalance datasets. In under sampling, we increase data of majority class. This causes loss of information and poor performance of the model. In over sampling data in minority class is increased and make it equal to the data in majority class with duplicating the data or generating new data. This can create noise in dataset and mislead the learning of model. We can use combination of both over sampling and under sampling. Hence, it is very essential to handle the imbalance datasets.

Happy Learning 🙂

 

Leave a Reply