Introduction

Linear Regression is a machine learning model that is based on supervised learning. It performs regression task. This model maps the linear relationship between dependent and independent variables, so have name linear regression. Regression models the target predicted variable based on independent variables. It is used to develop the relationship between variables and forecasting. Depending on the numbers of independent variables, linear regression are of two types :

  1. Simple Linear Regression
  2. Multiple Linear Regression

 

Simple Linear Regression

In simple linear regression, the independent variable is only one. The formula used in simple linear regression to find the relationship between dependent and independent variables is:

 y = Ø1 + Ø2*x 

y = Dependent variable (output variable)

x = Independent variable

Ø1 = Intercept 

Ø2 = Slope 

If we plot dependent variable (y) vs independent variable (x), we will get graph as shown below :

simple-linear-regression-hours-vs-percentage-example

 

Simple regression model tries to find the 'best fit line' (blue colored line in figure above) by adjusting the slope(Ø2) and the intercept(Ø1). The 'best fit line' is the line that is drawn such that the sum of the square of the distance between predicted value and true value is minimal. In other words, the sum of the distances from that line to the points is minimal. Once the best Ø1 and Ø2 is available, the model is ready to predict the output for corresponding input. To visualize the best fit line, take a look at the picture below: 

simple-linear-regression-best-fit-line

 

Multiple Linear Regression

Generally, the independent variables are more than one rather than just one variable. In this output variable dependent upon more than one variable so have name as multiple linear regression. It also develops linear relationship between dependent and independent variables. The formula used to develop the relationship between dependents and independent variables is:

 y = Øo + Ø1*x + Ø2*x + . . . . . . . +Øn*xn 

y =  Dependent variable

x = Independent variables

Øo = Intercept

Øi = Slope coefficient for each of the dependent variables, i = 1,2,3 ,. . . . k

k = Number of observations

n = Number of independent variables

The best fit line is determined by tuning the values of Øo and Øi such that the sum of the square of predicted and real value is minimal.

 

Cost Function

After we've trained our learning algorithm and got an hypothesis, we need to examine how good our results are. This is done by the so called cost function.

The cost function measures the accuracy of the hypothesis outputs. It does this by comparing the predicted values of the hypothesis with the actual true value.

By achieving the best fit regression line, the model aims to predict y value such that the error difference between predicted value and real value is minimum. So, it is very essential to update the value of Øo and Øi incase of multiple regression and value of Øo and Ø1 incase of simple linear regression, to reach the best value that minimizes the error between predicted value and true value.

linear-regression-cost-function

Cost function(J) of linear regression is the Root Mean Squared Error(RMSE) between predicted of y and true value of y.

 

Gradient Descent

To update  Øo and Ø1 values in order to reduce cost function (minimizing RMSE value) and achieving the best fit line the model uses Gradient Descent. The idea is to start with random Øo and Ø1 values and then iteratively updating the values, reaching the minimum cost.

We'll take a small example to see the working of linear regression. For this we'll create dummy datasets having 'age', 'no of hours' as input parameters and 'salary' as output parameters. For the demonstration, I'll be using Jupyter Notebook.

At first we'll create a dummy dataset

info = {
    'no of hours' : [1, 2, 5, 7, 8, 10, 12, 15, 17],
    'age' : [20, 34, 21, 27, 34, 21, 20, 45, 31],
    'salary' : [1000, 3000, 5000, 8000, 8500, 9000, 12000, 15000, 22000]
}

import pandas as pd
df = pd.DataFrame(info)
print(df)

Output:

gradient-descent-example

 

Let's visualize the datasets. First of all we'll import matplotlib and seaborn to visualize the dataset.

import matplotlib.pyplot as plt
import seaborn as sns

sns.scatterplot(x = "age", y= "salary", data = df)
plt.xlabel("age")
plt.ylabel("salary")
plt.title("age vs salary")
plt.show()

Output:

matplotlib-visualize-dataset

 

Also we'll visualize the no of hours vs salary graph

sns.scatterplot(x = "no of hours", y= "salary", data = df)
plt.xlabel("no of hours")
plt.ylabel("salary")
plt.title("no of hours vs salary")
plt.show()

Output:

matplotlib-visualize-data

 

Also we will take a look on no of hours vs age graph

sns.scatterplot(x = "age", y= "no of hours", data = df)
plt.xlabel("age")
plt.ylabel("no of hours")
plt.title("age vs no of hours")
plt.show()

Output:

matplotlib-data-visualize

Now, we will use linear regression model to predict the salary based on no of hours and age. The equation used will be in the form of:

 salary = Øo + Ø1 * no of hours + Ø2 * age 

Øo = Intercept 

Ø1 = Coefficient of no of hours

Ø2 = Coefficient of age

Now, we will start building the model. Let's select the features and target variables:

X = df.iloc[:, :2]
y = df.iloc[:, -1]

 

Then, we'll import necessary libraries as:

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

Now, splitting datasets into training and testing datasets:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25)

Build the model as:

lr = LinearRegression()
model = lr.fit(X_train, y_train)
pred = model.predict(X_test)
print(pred)

Output:

[ 6454.68201512 12813.11470225 24376.50611935]

 

Now ,let's see values of Øo, Ø1 and Ø2

print("Intercept :",model.intercept_)

Output:

Intercept : -8477.293570728314

Here we can see that the value of intercept(Øo) = -8477.293570728314

print("Slope :", model.coef_)

Output:

Slope : [1059.73878119  376.83817716]

As a result:

Ø1 -> coefficient of no of hours = 1059.73878119

Ø2 -> coefficient of age = 376.83817716

 

Conclusion

LInear regression algorithm is a machine learning algorithm used to do regression analysis. This model develops the linear relationship between dependent and independent variables minimizing the Root Mean Squared Error(RMSE) between the predicted and true value. Hence, price prediction is one of the example of linear regression. So, linear regression is simple yet most useful algorithm of machine learning.

If you want to learn more about the maching learning algorithms types, then check the link here.

Happy Learning ;-)