Feature selection is the process of reducing number of input features when developing a machine learning model. It is done because it reduces the computational cost of the model and to improve the performance of the model.
Features that have high correlation with output variable is selected for training the model. Selecting the subset of the input features is important because it can help building the most efficient model with those features that is most relevant to the target variable.
Model building with redundant features may mislead the model and may hamper the performance of the model. Hence, features selection is essential.
Categorization of the features selection
Features selection is subdivided into two parts namely:
- Supervised technique: It is the technique used for labelled data
- Unsupervised technique: It is the technique used for unlabelled data
For demonstration, I am using Jupyter Notebook and I will use heart disease prediction dataset from kaggle for implementation of various feature selection techniques. Here are some of the methods for feature selection:
1. Filter method
Filter method computes the relation of individual features to the target variable based on the amount of correlation that the feature has with target variable. It is univariate analysis as it check how relevant the features with target variables individually. The types of filter method are as follows:
a) Information gain method
Information gain method computes the reduction on entropy. Information gain is based on the information theory that gives how much information a feature gives relation to that of another variable. Let’s see how information gain method is used for feature selection:
At first, I am going to load the dataset
import pandas as pd df = pd.read_csv("heart.csv") df.head()
Now, let’s implement the information gain method as:
X = df.iloc[:, :-1] y = df.iloc[:, -1] from sklearn.feature_selection import mutual_info_classif scores = mutual_info_classif(X, y) print(scores)
[0.00113135 0. 0.14604363 0. 0.09081394 0.01610141 0.03330569 0.08534967 0.10247971 0.0602119 0.11768226 0.10865301 0.16903598]
Let’s plot a bar chart for better visualization:
import matplotlib.pyplot as plt features = df.columns[0:13] new_df = pd.Series(importane, features) new_df.plot(kind = 'barh') plt.ylabel("Features") plt.xlabel("scores") plt.title("Features with scores") plt.show()
Visualizing this bar chart, we can select the number of features as per requirement. Feature ‘trtbps’ seems to have lowest score and features such as ‘sex’ and ‘age’ also can be dropped from dataset while training the model.
b) Chi-square method
Chi-square method is used for categorical data and calculates the chi-square between input features and target variable. Chi-squared distribution assumes the null hypothesis to be true. The formula used for calculation of chi-square is:
Now, let’s implement the chi-square method:
from sklearn.feature_selection import SelectKBest from sklearn.feature_selection import chi2 feature = SelectKBest(score_func = chi2, k = 'all') best_features = feature.fit(X,y) print(best_features.scores_)
[ 23.28662399, 7.57683451, 62.59809791, 14.8239245 , 23.93639448, 0.20293368, 2.97827075, 188.32047169, 38.91437697, 72.64425301, 9.8040952 , 66.44076512, 5.79185297]
Using these scores and features, let’s plot the bar chart for better understanding:
import matplotlib.pyplot as plt features = df.columns[0:13] new_df = pd.Series(best_features.scores_, features) new_df.plot(kind = 'barh') plt.ylabel("Features") plt.xlabel("scores") plt.title("Features with scores") plt.show()
Visualizing this bar chart, we can select the top 10 or top 8 features. Also you can set k = 10(say) instead of k = ‘all’ for selecting top 10 features from dataset. Feature ‘thalachh’ has highest score and feature ‘fbs’ has lowest score.
c) Correlation coefficient method
In this method, the correlation coefficient of input feature is calculated with target variable. Correlation can be of positive and negative.
Positive correlation coefficient means if there is increase or decrease in the feature variable then there is corresponding increase or decrease in output variable. Negative correlation means that if there is increase in feature there is decrease in target variable and vice versa.
Correlation coefficient(r) has value ranging from -1 to 1.
If r = 1, high positive correlation,
If r = 0, no correlation,
If r = -1, highly negative correlation
Now, let’s see how correlation coefficient method is used for feature selection:
import seaborn as sns import matplotlib.pyplot as plt plt.figure(figsize = (13, 10)) sns.heatmap(df.corr(), annot = True) plt.show()
We know that correlation coefficient of a variable to itself is 1. Looking above correlation matrix, it is found that features ‘cp’, ‘thalachh’, ‘slp’ are highly positively correlated to the output variable and features ‘thall’, ‘caa’, ‘lodpeak’, ‘exng’, ‘age’ and ‘sex’ have negative correlation with output variable. Other than these, above mentioned features don’t have that much correlation with output variable. Hence, we can drop these features from dataset.
2. Wrapper method
Wrapper method don’t use statistical method for feature selection. It takes a subset of features and apply them to train the model and calculates the accuracy. And it keeps this process on repeat until it came with the best features and best accuracy of the model. Since it involves the training of the model several time, it is very expensive and time consuming. This method is only suitable for small dataset only.
a) Recursive Feature Elimination
Recursive Feature Elimination(RFE) recursively removes the redundant features until the desire number of features are achieved and hence improving the performance and accuracy of the model.
from sklearn.feature_selection import RFE from sklearn.model_selection import train_test_split from sklearn.svm import SVC from sklearn.metrics import accuracy_score X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2) rfe = RFE(SVC(kernel = 'linear'), n_features_to_select = 8) rfe.fit(X_train, y_train) pred = rfe.predict(X_test) print("Accuracy : ", accuracy_score(pred, y_test))
Accuracy : 0.8524590163934426
This is how RFE is implemented to select the features and obtain the accuracy of the model.
b) Forward selection method
Forward selection method is a iterative process that starts with no feature in the model. In each iteration, it keeps adding the most relevant features to the target variable. It continues this task until the addition of new features don’t improve the model performance. We are going to use same dataset as taken in above feature selection methods. For this we need mlxtend module so
$ pip install mlxtend
from sklearn.neighbors import KNeighborsClassifier from mlxtend.feature_selection import SequentialFeatureSelector ffs = SequentialFeatureSelector(KNeighborsClassifier(n_neighbors = 4), k_features = 10, forward = True, n_jobs = -1) fs = ffs.fit(X, y) print(fs.k_feature_names_) fs.k_score_
('age', 'sex', 'cp', 'fbs', 'restecg', 'exng', 'oldpeak', 'slp', 'caa', 'thall') 0.7625136612021859
These are the top 10 features that are most relevant to the output variable. We can select any number of features by specifying value of k_features.
c) Backward elimination method
Backward elimination method is just reverse process of the forward selection method. Initially it trains model with all the features in it and iteration by iteration it reduces the number of features ensuring the selection of best parameters to the model and hence increasing the accuracy of the model.
from sklearn.neighbors import KNeighborsClassifier from mlxtend.feature_selection import SequentialFeatureSelector ffs = SequentialFeatureSelector(KNeighborsClassifier(n_neighbors = 4), k_features = 8, forward = False, n_jobs = -1) fs = ffs.fit(X, y) print(fs.k_feature_names_) print(fs.k_score_)
('sex', 'cp', 'fbs', 'restecg', 'exng', 'slp', 'caa', 'thall') 0.8513114754098361
These are the top 8 features that are relevant to the target variable.
3) Embedded method
Embedded method performs feature selection while creating the machine learning model.
In this method some of the coefficient is shrink to zero, indicating certain features are multiplied by zero to estimate the target. So, these features can be removed because they do not contribute to the performance of the model.
from sklearn.linear_model import LogisticRegression from sklearn.feature_selection import SelectFromModel sfm = SelectFromModel(LogisticRegression(C = 1, penalty = 'l2')) sfm.fit(X_train, y_train) important_features = X_train.columns[(sfm.get_support())] print(important_features)
Index(['sex', 'cp', 'restecg', 'exng', 'oldpeak', 'caa', 'thall'], dtype='object')
Feature selection is important for filtering the redundant features from the dataset. Presence of the redundant features can mislead the model which can cause degradation in model performance.
Filter method of model selection use staistical approach to select the features while wrapper method don’t use statistical approach for feature selection. This method is only suitable for small number of datasets and can be very complex in term of computation with large datasets.
Embedded method select the features in time of model building thus has name embedded. Hence, feature selection is important because all the features are not relevant to the output variable and selecting only subset of the features available improves the performance of the model.
Happy Learning 🙂