Outliers are the data that are distant away from the all other observation or unusual data that doesn't fit the data. In other words, outliers are the data that does not fit the mainstream of data.
Impacts of outliers
In machine learning projects, during model building it is important to remove those outliers because presence of those outliers can mislead the model. Presence of outliers may change the mean and standard deviation of the whole dataset that can badly affect the performance of the model. Outliers also increases the variance error and reduces the power of statistical test.
Some of the reason for presence of outliers are as follows:
- Data entry error(human error)
- Experimental measurement error
- Measurement error(Instrument error)
- Sampling error
Detection of outliers
Detecting outliers is one of the challenging job in data cleaning. There is no any precise way to detect and remove outliers due to specific of datasets. Yet, raw assumption and observation must be made to remove those outliers that seems to be unusual among all other data. The two ways for detection of outliers are:
- Visualization method
- Statistical method
1. Visualization method
In this method, visualization technique is used to identify the outliers in dataset. Boxplot and scatterplot are the two methods that are used to identify the outliers. Box plot is used for univariate analysis while scatterplot is used for multivariate analysis.
Boxplot is a graphical method of displaying numerical data based on five-number summary namely:
i. Minimum(0th percentile)
ii. Maximum(100th percentile)
iii. Median(50th percentile)
iv. First quartile(25th percentile)
v. Third quartile(75th percentile)
Boxplot consist of line extending from first and third quartile which are known as whiskers to show the variability of data from first and third quartile.
This is a boxplot of age of the individual and the point lying near to 200 mark is marked as an outlier. The age equals to 200 is lying far away from the other data and seems to be unusual. This is how boxplot(a visualization tool) is used for detection of outliers.
Scatterplot is used for multivariate analysis for detection of outliers. The data point lying far away from the other datapoint can be visualized using scatterplot.
In above scatterplot, two points are lying at very far distance from other data points. By visualizing data using scatterplot we can detect outliers.
2. Statistical method
Statistical terms such as standard deviation, interquartile range, z-score are used for detection and removal of outliers. In this tutorial, we'll use standard deviation method, interquartile range(IQR) method and z-score method for outlier detection and removal.
Interquartile range is difference of third quartile(Q3) and first quartile(Q1). In this method, anything lying above Q3 + 1.5 * IQR and Q1 - 1.5 * IQR is considered as outliers. For demonstration purpose I'll use Jupyter Notebook and heart disease datasets from kaggle. Let's read and see some part of the dataset.
Make sure you have install pandas and seaborn using the command:
$ pip install pandas $ pip install seaborn
import pandas as pd df = pd.read_csv("heart.csv") df.head()
This is the dataframe and we'll be using 'chol' column for further analysis. First of all we'll see whether it has outlier or not:
import seaborn as sns sns.boxplot(df['chol'])
We can see that there are some outliers. Now, we are going to see how these outliers can be detected and removed using IQR technique. For IQR method, let's first create a function:
def outliers(df, feature): Q1= df[feature].quantile(0.25) Q3 = df[feature].quantile(0.75) IQR = Q3 - Q1 upper_limit = Q3 + 1.5 * IQR lower_limit = Q1 - 1.5 * IQR return upper_limit, lower_limit upper, lower = outliers(df, "chol") print("Upper whisker: ", upper) print("Lower Whisker: ", lower)
Upper whisker: 369.75 Lower Whisker: 115.75
As discussed earlier, anything lying outside between 369.75 and 115.75 are outliers .
Let's take look on outliers:
df[(df['chol'] < lower) | (df['chol'] > upper)]
These are the outliers lying beyond the upper and lower limit computed with IQR method.
To remove these outliers from datasets:
new_df = df[(df['chol'] > lower) & (df['chol'] < upper)]
So, this new dataframe new_df contains the data that is between upper and lower limit as computed using IQR method.
Using this method, we found that there are five(5) outliers in dataset. This is how outliers can be easily detected and removed using IQR method.
Standard deviation method
Standrad deviation is the measure of how far a data point lies from the mean value. Generally, it is common practice to use 3 standard deviation for detection and removal of outliers. It is not mandatory to use 3 standard deviation for removal of outliers, one can use 4 standard deviation or even 5 standard deviation according to their requirement.
def outlier_removal(df, variable): upper_limit = df[variable].mean() + 3 * df[variable].std() lower_limit = df[variable].mean() - 3 * df[variable].std() return upper_limit, lower_limit upper_limit, lower_limit = outlier_removal(df, "chol") print("Upper limit: ", upper_limit) print("Lower Limit: ",lower_limit)
Upper limit: 401.75627936643036 Lower Limit: 90.77177343885015
Anything that doesn't come between these two upper limit and lower limit will be considered as outliers.
Now, take look in these outliers:
df[(df['chol'] < lower_limit) | (df['chol'] > upper_limit)]
These are the outliers that is lying beyond the upper and lower limit as computed using standard deviation method. Using this method, we found that there are 4 outliers in dataset.
To remove these outliers from our datasets:
new_df = df[(df['chol'] > lower) & (df['chol'] < upper)]
This new dataframe contains only those datapoints that are inside the upper and lower limit boundary. So, this is how we can easily detect and remove the outliers from our datasets.
Z-score is the measure of how many standard deviation away the data point is. The formula used to calculate the z-score is:
μ = mean
σ = Standard deviation
Z-score is similar to that of the standard deviation method for outlier detection and removal. One can use any of these two(z-score or standard deviation) method for outliers treatment.
Let's see how z-score is used to detect and remove the outliers:
df['z_score'] = (df['chol'] - df['chol'].mean()) / df['chol'].std() df.head()
Now, using this calculated z-score we'll mark outliers if the z-score is above 3 or below -3.
df[(df['z_score'] < -3) | (df['z_score'] > 3)]
These are the outliers that we obtained after removing those data that has z-score below -3 and above 3. We can see that the outliers that we obtained from z-score method and standard deviation method is exactly same.
The datasets that have z-score greater than 3 means that it is more than 3 standard deviation away from mean value which is the same concept applied in standard deviation method. So, z-score method is alternative of the standard deviaiton method of outlier detection. Using this method we found that there are 4 outliers in dataset.
To remove these outlers we can do:
new_df = df[(df['z_score'] < 3) & (df['z_score'] > -3)]
This new dataframe gives the dataset that is free from outliers having z-score between 3 and -3.
Outliers detection and removal is the important task in data cleaning process. These unusual data may change the standard deviation and mean of the dataset causing poor performance of the machine learning model.
Hence, outliers must be removed from the dataset for better performance of model but it is not always a easy task.
Outliers can be detected using visualization tools such as boxplot and scatterplot. Some of the statistical methods such as IQR, standard deviation, z-score methods can be implemented for detection and removal of the outliers.
As we seen above the z-score method and standard deviation method is exactly same. In some cases, detection of outliers can be easy but in some cases it can be challenging and one should go with what is required.
Make your voice heard! The best opinions in the comments below will be included in this article.
Happy Learning :-)