In this blog, we are going to discuss about some important topics in machine learning. We are going to discuss about bias, variance, underfitting, overfitting, performance metrices used in classification and regression analysis.
Bias is the difference between actual value and the predicted value that a model predicts. In machine learning, data is fed to the machine learning model, the model finds the patterns from data and learns from data.
While creating a machine learning model, we should create a generalized model so that it performs well for new data. If the difference between actual value and predicted value is large, it is known as high bias. This leads to underfitting of the model.
A machine learning model needs a sufficient amount of data for understanding the patterns in it and later make predictions on test data. Dataset contains unnecessary data that may mislead the model.
If the model is affected by small fluctuations in the dataset, the scatterings of the predicted value will be large. This is called variance. High variance will cause large errors in test data and results in overfitting.
Let’s visually understand the term low bias, high bias, low variance and high variance.
If a model has low bias then the predicted value will be close to the actual value and if model has low variance the predicted data will be less scattered. As shown in first section figure, the predicted values are less scattered and close to the actual value. The smaller circle represents the actual value.
In second section of figure, model has less bias and high variance. So, the predicted data are close to the actual data but highly scattered.
In third section of figure, model has high bias and low variance. So, predicted data are far from the actual value but less scattered.
In fourth section of figure, model has high bias and high variance. So, predicted data are far from the actual data and highly scattered.
If the dataset is not large enough, the machine learning model cannot find the pattern from the dataset. Algorithm will be unable to learn from the data that is fed to the algorithm. There will be high error in training as well as test data.
The model resulting from underfitting gives high error for training data and as well as test data.
When we provide the dataset , the algorithm will look at it a number of times for finding the patterns in data. Because of this, machine learning models learn from noise too. This causes the model to be complex and the model will exactly fit the training data, but gives high error for new data.
This is how a overfitting of model looks like. The model will be complex and gives high accuracy for training data but large error for test data or any new data.
Confusion matrix is a tool for observation of model performance in classification problem. The terms used in confusion matrix are:
Correct prediction of event case. For example, prediction of having cancer when the person is actually having cancer.
False or wrong prediction of event case. For example, prediction of having cancer when person is not having cancer.
False or wrong prediction of non-event case. For example, prediction of not having cancer when person is having cancer.
Correct prediction of non-event case. For example, prediction of not having cancer when person is also not having cancer.
With confusion matrix we calculate accuracy of our model as
Precision is a ratio of True Positive(TP) to the total Positive(TP + F P). It gives the value that is equal to the total TP that a model predict to the Total positive value present. The formula for calculating precision is
A precision of 60% means our model has predicted 60 True Positive(TP) correctly among 100 positive value.
Recall gives the measure how accurately our model predicts the True Positive(TP) from available actual total positive. It is the ratio of True Positive(TP) to the total actual positive(TP + FN). The formula for recall is
80 % recall means that our model has predicted 80 True Positive(TP) out of 100 actual positive value.
We need to know which one(precision or recall) we need to consider for evaluation of a model. Some time we need precision more than recall and some time we need recall. Thus, f1-score combines both precision and recall and takes harmonic mean. The formula for recall is
9. Left skewed distribution
Left skewed distribution is such type of distribution where the curve is elongated towards left. In this type of distribution the relationship between mean, median and mode is mean < median < mode. Let’s see how left skewed distribution graph looks like
Example : Mortality on basis of different age group. Generally the average person lives 70-80 years. The number of death of person at young age is larger. There are very less number of people who have live more than 100 years.
10. Right skewed distribution
Right skewed distribution is such type of distribution where the curve is elongated towards right side. In this type of distribution the relationship between mean, median and mode is mean > median > mode. Let’s see how right skewed distribution graph looks like
Example: Income of people in USA.
11. Mean Absolute Error(MAE)
Mean Absolute Error(MAE) measures the absolute difference between actual value and predicted value and computes the average. It only measures the distance between actual and predicted value but not the direction( actual > predicted or predicted > actual). The formula for calculating MAE is
12. Mean Squared Error(MSE)
Mean Squared Error(MSE) calculates the square of difference between actual and predicted value and computes the average. Since it computes the square of the error, if there is outlier in dataset, the error will be high. So, this method is sensitive to outliers. The formula for calculation of MSE is
13. Root Mean Squared Error(RMSE)
Root Mean Squared Error(RMSE) calculates the square root of square of difference between actual and predicted value and computes the average. It overcomes the disadvantage of MSE as it calculates the square root of MSE. Because of this presence of outlier do not have impact in total error. RMSE is robust to outliers. Formula for calculation of RMSE is
Hence, these are the some of the very important topics in machine learning that one should know.
For more detailed information about performance metrics in regression analysis, follow this link
Happy Learning 🙂