How to Handle Null Values in Pandas

How to Handle Null Values in Pandas

Introduction

Null values in dataset are the empty field represented as NaN(Not a Number). Null value do not mean zero value, actually it is an empty field. Datasets that are available for preparing machine learning model may contain some null values in it. Those null values must be filled with another meaningful value or they must be dropped from the dataset. Null values may present in datasets because of the error by human during data entry or any other factors.

In this tutorial, we’ll learn how to fill those null values in the dataset. They must be filled or dropped from dataset so that machine learning model can perform well.

For demonstration, I will be using jupyter notebook. We are going to create a dataset having some null values having both categorical values and numerical values.

Here are some of the ways to fill the null values from datasets:

1. Dropping null values

Python Dataframe has dropna() function that is used to drop the null values from datasets. This method should only be used when dataset is too large and null values are in small number.

import pandas as pd
import numpy as np
info = {
    'Age' : [np.NaN, 23, 45, np.NaN, 34, np.NaN, np.NaN, 45, 54],
    'Gender' : ['male', np.NaN, 'female', 'male', np.NaN, np.NaN, 'female', 'male', 'female']
}
df = pd.DataFrame(info)
print(df)

Output

    Age  Gender
0   NaN    male
1  23.0     NaN
2  45.0  female
3   NaN    male
4  34.0     NaN
5   NaN     NaN
6   NaN  female
7  45.0    male
8  54.0  female

This the dataset has some of the null values represented by NaN values. Before treating those null values, let’s see how we can know how many null values are present in dataset or not present in dataset.

df.isnull().sum()

Output

Age       4
Gender    3
dtype: int64

Using isnull() and sum() function we will be able to know how many null values are present in each column. There are 4 null values in ‘Age’ column and 3 null values in ‘Gender’ column.

Let’s take a look how dropna() is implemented to drop null values from dataset

df2 = df.dropna()
print(df2)

Output

    Age  Gender
2  45.0  female
7  45.0    male
8  54.0  female

Using dropna() function we can drop all the rows from dataset that has null value. Above dataframe is obtained after dropping all the rows having null values.

2. Using fillna() function

Using fillna() function, we can fill the null values with the desired value.

df['Age'] = df['Age'].fillna(0)
df['Gender'] = df['Gender'].fillna("Not Specified")
print(df)

Output

    Age         Gender
0   0.0           male
1  23.0  Not Specified
2  45.0         female
3   0.0           male
4  34.0  Not Specified
5   0.0  Not Specified
6   0.0         female
7  45.0           male
8  54.0         female

Null values in ‘Age’ column is fiiled with zero(which is not a good practice) and ‘Gender’ column with ‘Not Specified’.

Rather than filling the null values with other kind of data we can use ‘ffill’ and ‘bfill’ method. The former method means forward fill which fill the null values using previous data while latter method means backward fill which fill null values using next real value in dataset.

df['Age'] = df['Age'].fillna(method = 'ffill')
df['Gender'] = df['Gender'].fillna(method = 'ffill')
print(df)

Output

    Age  Gender
0   NaN    male
1  23.0    male
2  45.0  female
3  45.0    male
4  34.0    male
5  34.0    male
6  34.0  female
7  45.0    male
8  54.0  female

What ‘ffill’ method does is that if there is a null value in any column it will fill that null value using previous value. We cans see that the first null value in ‘Age’ column is not filled because there is no any previous data to fill it. This is one of the disadvantage of the ‘ffill’ method. The next null value is filled with ‘45.0’ as previours value is ‘45.0’. Same process is applied in ‘Gender’ column to fill the null values.

df['Age'] = df['Age'].fillna(method = 'bfill')
df['Gender'] = df['Gender'].fillna(method = 'bfill')
print(df)

Output

    Age  Gender
0  23.0    male
1  23.0  female
2  45.0  female
3  34.0    male
4  34.0  female
5  45.0  female
6  45.0  female
7  45.0    male
8  54.0  female

‘bfill’ method fills the current null value with the next real value. We can see that, unlike in ‘ffill’ method, it filled the first null value from ‘Age’ column with next real value that is 23.0.

3. Using interpolation

Interpolation is one of the method that is used to fill the null values. Dataframe has interpolate() function that is used to fill the null values.

df.interpolate(method = 'linear', limit_direction = 'forward')

Output

         Age  Gender
0        NaN    male
1  23.000000     NaN
2  45.000000  female
3  39.500000    male
4  34.000000     NaN
5  37.666667     NaN
6  41.333333  female
7  45.000000    male
8  54.000000  female

Interpolation works only on numerical data. So, we can see that null values in ‘Gender’ dataframe is unfilled.

df2 = df.interpolate(method = 'linear', limit_direction = 'backward')
print(df2)

Output

         Age  Gender
0  23.000000    male
1  23.000000     NaN
2  45.000000  female
3  39.500000    male
4  34.000000     NaN
5  37.666667     NaN
6  41.333333  female
7  45.000000    male
8  54.000000  female

This is how we can use interploation method to fill the null values in dataset.

4. Using mean, median and mode method

Let’s see how mean, median and mode are used to fill  the null values in dataset. Mean and median are used to fill the null values of numerical data and mode is used to fill the null values of categorical data.

df['Gender'] = df['Gender'].fillna(df['Gender'].mode()[0])
df['Age'] = df['Age'].fillna(df['Age'].mean())
print(df)

Output

    Age  Gender
0  40.2    male
1  23.0  female
2  45.0  female
3  40.2    male
4  34.0  female
5  40.2  female
6  40.2  female
7  45.0    male
8  54.0  female

Categorical values are filled with mode value of the same column i.e ‘Gender’ column. The ‘Age’ column is filled with mean value of the same column. The mean value is 40.2 and the mode value is ‘female’. This is one of the most used method for filling the null values for categorical and numerical null values.

df['Gender'] = df['Gender'].fillna(df['Gender'].mode()[0])
df['Age'] = df['Age'].fillna(df['Age'].median())
print(df)

Output

    Age  Gender
0  45.0    male
1  23.0  female
2  45.0  female
3  45.0    male
4  34.0  female
5  45.0  female
6  45.0  female
7  45.0    male
8  54.0  female

Choose median or mean, why?

The numerical values also can be filled using median value. Sometime filling null values with mean value can hamper the whole dataset in case of presence of outliers as presence of outliers can alter the mean and standard deviation of data. So, filling null values with median value can also be an very effective method.

Conclusion 

The raw dataset that is available for preparing machine learning model may have some null values in it. We must fill those null values with suitable and meaningful data so that the model performance on those data is good. There are several ways of filling null values. Interpolation, fillna, dropna, using mean, median and mode value are some of the ways of filling null values. Dropping of null values is not useful in small dataset but can be useful if the dataset is large and has few null values in it. Machine learning model needs those null values to be filled or removed. Hence, filling null values with suitable value is very essential.

Happy Learning:-)

Leave a Reply