# Regularization Concept In Machine Learning

## Introduction

Regularization is very important concept in machine learning. Over fitting is the major problem in the field of machine learning. Over fitting results into high error for test data or new data that the model hasn’t seen before. We must make sure that there is no over fitting problem while building the machine learning model.

We can overcome the problem of over fitting with following techniques

• Regularization
• Feature selection
• Cross validation techniques
• Ensemble techniques

In this tutorial we are going to discuss regularization techniques and how we can use it to overcome the problem of over fitting.

Regularization is a technique that shrinks the coefficient estimates towards zero. This technique adds a penalty to more complex models and discourages learning of more complex models to reduce the chance of over fitting.

Now, let’s consider a simple linear regression that looks like

Here, bo represents the intercept and b1, b2, .. bn represents the slope. For detail study of linear regression click this link

The loss function that is associated with linear regression is given below

Yi is the True value and Ypredicted is given by this formula

The main objective of linear regression model is to minimize the cost function i.e Residual Sum of Squares(RSS). Regularization adds penalty to this RSS to minimize the over fitting to create generalized model.

### Types of regularization

Generally there are two types of regularization and they are

### Ridge(L1) regularization

In Ridge regularization the cost function is altered by adding a penalty term which is equal to the square of the magnitude of the coefficient estimates. After adding penalty term the cost function becomes

Here lambda(λ) is the penalty factor. The value of lambda(λ) determines the penalty added to the cost function. The different value of lambda(λ) determines the number of independent variable shrinkage.

-> If lambda(λ) = 0, no features will shrink

-> If lambda(λ) = infinity, all features will shrink

As the value of lambda(λ) increases, the number of features shrinkage will increase. Let’s take an simple example to understand how ridge regression works. For simplicity let’s suppose that linear regression algorithm is fed with this type of data

Suppose we want to predict the marks obtained by students based on the number of hours per day they studied. We suppose that only three data for simplicity. As we can see that the best fit line passes through all the training point, it is example of over fitting. Let’s see how ridge regression is used to overcome the problem of over fitting.

### Working procedure of ridge regularization

Lets take λ = 1(which can be any positive number) and suppose that three slopes b1, b2, b3 equal to 1.2, 1.3, 1.4 respectively.

From above equation of RSS, the value of RSS will be zero(0) for simple linear regression as all the True and predicted value overlaps. Because of this value of RSS, linear regression algorithm stops there and consider that it have found the best fit line. But we know that because of over fitting, the value of RSS becomes zero(0) but the best fit line is not the true best fit line.

To overcome this effect we need to look for other best fit line which can give low bias and low variance. This is the condition where regularization comes into picture.

From above equation of modified RSS for ridge regression and values of slopes that we considered we  can compute the value of RSS and found to be (0 + 1*(1.2^2) + 1*(1.3^2) + 1*(1.4^2)) = 5.09. We can see that the value of RSS after adding penalty is not zero(minimum). So, algorithm will look for another best fit line.

From above updated best fit line, algorithm calculates new RSS. Let’s slope values are 0.8, 1.05, 1.08. The new value of RSS is 0 + 1*(0.8^2) + 1*(1.05^2) + 1*(1.08^2) = 2.7626. Algorithm further looks for minimization of RSS. The new best fit line be

From above updated best fit line, algorithm calculates new RSS. Let’s slope values are 0.8, 1.05, 1.08. The new value of RSS is 0 + 1*(0.5^2) + 1*(0.85^2) + 1*(0.95^2) = 1.8189. Algorithm further looks for minimization of RSS. This process continues until RSS is minimum and the line will be the best fit line.

Hence, algorithm gets the best fit line which will produce low bias and low variance and removes the problem of over fitting. This is the working mechanism of ridge regression.

### Lasso(L2) regularization

In Lasso regularization the cost function is alter with adding a penalty term which is equal to the absolute of the magnitude of the coefficient estimates. After adding penalty term the cost function becomes

Just like in ridge regularization lambda(λ) is the penalty factor. The value of lambda(λ) determines the penalty added to the cost function. The different value of lambda(λ) determines the number of independent variable shrinkage.

-> If lambda(λ) = 0, no features will shrink

-> If lambda(λ) = infinity, all features will shrink

The working mechanism is same as that of ridge regression. Let’s take same problem as above to determine the marks of students based on the number of hour studied by student.

### Working mechanism of Lasso regularization

Lets take λ = 1(which can be any positive number) and suppose that three slopes b1, b2, b3 equal to 1.2, 1.3, 1.4 respectively.

For simple linear regression the value of RSS is zero(0) as the there is no error between true value and predicted value. To overcome the problem of over fitting using lasso regression, algorithm calculate the value of RSS using the formula mentioned above. Using the supposed value of slopes the value of RSS is 0 + 1*1.2 + 1*1.3 + 1*1.4 = 3.9.

Since the value of RSS seems to be not minimal, algorithm will find next line same as ridge regression and calculates the RSS. Let’s slope values are 0.8, 1.05, 1.08 and new value of RSS is 0 + 1*0.8 + 1*1.01 + 1*1.05 = 2.86.

Algorithm looks continuously for best fit line for which the value of RSS will be minimal. Let’s slope values are 0.8, 1.05, 1.08. The new value of RSS based on this new line is 0 + 1*0.5 + 1*0.85+ 1*0.92 = 2.7.

This process continues until RSS is minimum and algorithm finds best fit line for model. Hence, we will get the best fit line with low bias and low variance and removes the problem of over fitting. This is the working mechanism of lasso regression.

### Difference between ridge and lasso regularization

The basic difference between lasso and ridge regularization is that ridge regularization technique does not shrinks the features to complete zero but lasso regression shrinks the some features to zero and this is why lasso regularization is used for feature selection.

## Conclusion

Regularization is a type of regression that shrinks some of the features to avoid the complex model building. This regularization is essential for overcoming the over fitting problem. Ridge(L1) regularization only performs the shrinkage of magnitude of the coefficient but lasso(L2) regularization performs feature scaling too. We need to build generalized model with low bias and low variance. In case of over fitting, regularization is very essential for building the generalized model.

Happy Learning 🙂