Introduction 

The dataset available for machine learning implementation has numerical as well as categorical features. Categorical data are of three types namely ordinal, nominal and boolean.

Ordinal data are those data that has priority ordering with each variables while nominal data are those data which don't have priority ordering. Boolean data are those data having label as either True or False.

Categorical features refers to string type data and can be easily understood by human beings. But in case of machine, it cannot interpret the categorical data directly. Therefore, the categorical data must be translated into numerical data that can be understood by machine.

Machine learning models cannot interpret the categorical data. Hence, needs the translation to numerical format. There are many ways to convert categorical data into numerical data. Here in this tutorial, we'll be discussing three most used methods and they are:

  1. Label Encoding

  2. One Hot Encoding

  3. Dummy Variable Encoding

We are going to discuss each methods in detail with some of the examples. If you are new to machine learning, I'll try to make the concept clear and easily understandable. So, without further due, let's dive into the topic. 

I assumed that you have already installed python in your system. We will be using python packages pandas and sckit-learn in examples below. So make sure you installed them using pip installer before running examples in your editor as:

$ pip install pandas
$ pip intall sckit-learn

 

1. Label Encoding

Label Encoding refers to the conversion of categorical data into numerical data that a computer can understand. In this method, every class of data is assigned to number starting from zero(0). We'll make a dataframe just to see how label encoding works:

import pandas as pd
info = {
    'Gender' : ['male', 'female', 'female', 'male', 'female', 'female'],
    'Position' : ['CEO', 'Cleaner', 'Employee', 'Cleaner', 'CEO', 'Cleaner']
}
df = pd.DataFrame(info)
print(df)

Output:

   Gender  Position
0    male       CEO
1  female   Cleaner
2  female  Employee
3    male   Cleaner
4  female       CEO
5  female   Cleaner

This is the dataframe that we are going to work with. In this dataframe, it contains Gender and Position(say in a company) as features. Since these features are categorical, needed to be converted into numerical data.

We'll implement LabelEncoder of sklearn to convert these features into numeric values as:

from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
gender_encoded = le.fit_transform(df['Gender'])

print(gender_encoded)

Output:

[1 0 0 1 0 0]

As we can see , male is encoded into zero (0) and female is encoded into one(1). We've to add this encoded data to original dataframe, we can do this as:

df['encoded_gender'] = gender_encoded
print(df)

Output:

   Gender  Position  encoded_gender
0    male       CEO               1
1  female   Cleaner               0
2  female  Employee               0
3    male   Cleaner               1
4  female       CEO               0
5  female   Cleaner               0

The column with categorical data needs to be dropped from original dataframe. Now, we are going to implement label encoding to 'Position' column to convert into numerical data as:

encoded_position = le.fit_transform(df['Position'])
df['encoded_position'] = encoded_position
print(df)

Output:

   Gender  Position  encoded_gender  encoded_position
0    male       CEO               1                 0
1  female   Cleaner               0                 1
2  female  Employee               0                 2
3    male   Cleaner               1                 1
4  female       CEO               0                 0
5  female   Cleaner               0                 1

If you compare Position and encoded_position column, it can be seen that CEO is encoded to 0, Cleaner is encoded to 1 and Employee is encoded to 2 i.e

CEO => 0

Cleaner => 1

Employee => 2

One big disadvantage of the label encoder is that it encode categorical features into numbers starting from 0 that can lead to the priority issue. For instance male is encoded to 0 and female is encoded to 1. With this, it can be interpreted that female has high priority than male, which is meaningless. Label encoding is done with ordinal dataset(finite set of values with rank ordering between variables).

 

2. One Hot Encoding

Dataset with nominal(no rank ordering between variables) categories, integer encoding may not be sufficient. Integer encoding to nominal data may leads to the misleading which can result in poor performance of the model.

One hot encoding convert integer encoding into binary variable. Each bit represent a category. If the variable cannot belong to multiple categories at once, then only one bit in the group can be “on”. This is called one-hot encoding.

Before applying one hot encoding, the categorical variable is converted into numeric value using label encoder and the one hot encoding is implemented to this numeric variable.

We are going to use previously encoded data for one hot encoding as:

from sklearn.preprocessing import OneHotEncoder

gender_encoded = le.fit_transform(df['Gender'])
gender_encoded = gender_encoded.reshape(len(gender_encoded), -1)
one = OneHotEncoder(sparse=False)

print(one.fit_transform(gender_encoded))

Output:

[[0. 1.]
 [1. 0.]
 [1. 0.]
 [0. 1.]
 [1. 0.]
 [1. 0.]]

Since, one hot encoding takes data in vertical(column) format, we need to reshape the array obtained from label encoder. Here,  male is encoded to [0 1] and female is encoded into [1 0]. 

 

3. Dummy Variable Encoding

The problem with the one hot encoding technique is that it creates redundancy. For example, if male is represented by [0 1], then we don't need [1 0] to represent female. We can represent female with [0 0]. This is called dummy variable encoding. Dummy variable encoding represents n categories with n-1 binary value.

To see how a dummy variable encoding works, we'll encode categorical data of previous dataset as:

import pandas as pd
pos = pd.get_dummies(df['Position'], drop_first = True)
print(pos)

Output:

   Cleaner  Employee
0        0         0
1        1         0
2        0         1
3        1         0
4        0         0
5        1         0

We can see that Cleaner is represented with [1 0], Employee is represented with [0 1] and CEO is represented by [0 0].

Here, there are 3(n=3) categories and dummy variable encoding represents it in 2(n-1)bit binary value. Pandas has get_dummies() function that is implemented to perform dummy variable encoding. drop_first argument is set True to drop the first column so that dummy state can be achieved.

 

Conclusion

Machine understand numerical data only but not categorical data. Implementing machine learning model needs input and output variables into numerical form. Hence, it is very important to handle the categorical data.

Label encoding method is used with ordinal dataset while one hot encoding is used with nominal dataset. The label encoding method give rise to the priority issue while one hot encoding leads to the redundancy problem. Dummy variable encoding is used for linear regression(and other regression algorithms).  

Happy Learning:-)