Regression vs Classification in Machine Learning
- January 25, 2020
- Posted by: admin
- Category: Machine Learning
What is Machine Learning?
Machine Learning is a subset of Artificial Intelligence which lets machine to discover pattern from the data, draw insights and helps in decision making.
The mathematical model is built and it is trained on the sample data. Different types of learning are Supervised learning, Unsupervised Learning, semi-supervised learning, Reinforcement Learning, Self-Learning etc.
In the Supervised Learning the prediction model is build from the set of training data which is labelled (i.e. data contains both the input and output). Regression algorithm and Classification algorithm are the types of supervised learning.
What is Regression in Machine Learning
Francis Galton coined the term “Regression” in context of biological phenomenon. The work was later extended to general statistical context by Karl Pearson and Udny Yule. Regression analysis is the statistical method which derives the relationship and determines the strength among dependent variables and one or more independent variables. Regression analysis also indicate the impact of independent variables on dependent variable.
The meaning of Regression is any procedure which tries to find the relationship between variables.
When Regression is used??
The Regression is used when we want to predict the output variable which is continuous or a real value.
- Predicting the Price of House
- Predicting the Height of person
- Predicting the salary of person
- Predicting Temperature
There are different Regression modelling techniques. Among all Linear Regression is the most popular algorithm. It is the simplest form of Regression.
The Linear Regression is represented by y=ax+b+e where a is the slope of the line and b is the intercept of the line and e is the error.
In the diagram below blue dots are the observed data points and red line is the line of best fit.
There are many different types Regression algorithm like Linear Regression, Polynomial Regression, Lasso Regression, Ordinal Regression, Quantile Regression, ElasticNet Regression, Stepwise Regression, Poisson Regression, Cox Regression etc.
In multiple regression there is more than 1 independent variables.
What is Classification in Machine Learning ?
The Classification Algorithm is used when we want to predict the output variable which is discrete. The dependent variable is predicted by analyzing the dependent variables.
The main goal of classification is to identify the category of the dependent variables based on training data.
Classification of fruits (by analysing the properties – colour, size, texture etc.)
Classification of Animals (input images)
Email spam identification
The Different Classification Algorithms are:
- Logistic Regression (Linear Classifier)
- Naïve Bayes
- Nearest Neighbour
- Support Vector Machine
- Decision Tree
- Boosted Trees
- Random Forest
- Neural Networks etc.
The Logistic Regression finds the probability of certain class or event. Using Logit Function, it simply predicts the probability of the occurrence of an event. Suppose if we want to find whether the person is diabetic or not based on his age, Blood pressure (bp) and sex.
More formally it can be written as
In this example Diabetes is dependent variable and age, BP, sex is independent variable.
A Problem where the outcome is of two classes is known as binary classification problem.
A Problem where the outcome is more than two classes is known as Multi-class classification.
A problem where a data point is assigned multiple labels is known as Multi-Label Classification Problem
The Types of Logistic regression is
- Binary Logistic Regression: The Outcome is of two classes. E.g.: spam or not spam
- Multinomial Logistic Regression: More than two outcome classes without any order E.g.: shape – rectangle,round,triangle
- Ordinal Logistic Regression: More than two outcome classes with ordering. E.g.: Grades – Distinction, First class, Second class
To selection right algorithm for modelling it is very important to understand whether the problem is a classification problem or Regression problem.
Performance Evaluation of Classification and Regression:
It is very important to evaluate the performance of the model. Both classification and Regression has various methods, formulas and techniques to evaluate the performance of an Algorithm.
Performance metrics for Regression:
The different metrics for Regression problems are:
- Mean Absolute Error (MAE): average squared difference between the estimated values and the actual value.
- Root Mean Squared Error (RMSE): Difference between the predicted values and the observed values. It measures the spread of the residuals.
- R – squared: known as coefficient of determination which tells the percentage of points falls on the regression line.
- Adjusted R square : It indicates how well the data points fir the curve. It considers the significant data points only.
Performance Metrics For Classification :
To calculate the performance different metrics are used but apart of metrics specific data is required to calculate the performance of the model that is True positive, True Negative, False positive, False Negative. To get visual matrix python provides confusion matrix which is a skikit-learn library.
Based on this information we calculate:
Accuracy: Accuracy is number of predictions our model got correct.
Accuracy = Correct Predictions / Total Number of Predictions
Precision : Ratio of Correct positive observations to total predicted positive observation.
Precision = TP/TP+FP
Recall : It is the ratio of positive predicted observation to the actual observations. Recall is also known as sensitivity.
Recall = TP/TP+FN
Specificity : It measures the True Negative Rate.
F1 score : It is a harmonic mean of Precision and Recall.
ROC/AUC curve : It shows the performance of the model at thresholds by plotting a graph of True positive rate against False positive rate. AUC is the Area under ROC curve.
Log loss : It measures the performance where the prediction input is a probability value between 0 and 1.
Classification Vs Regression
|Prediction||The output variable is discrete in nature||The output variable is continuous in nature|
|Find||Decision boundary||Best Fit line|
|Evaluation||calculate accuracy||Calculate sum of squared errors, R- squared|
|Example Algorithms||logistic regression, Decision Tree, Random Forest etc||Linear Regression, Polynomial Regression etc.|
To choose the best model for your specific use case it is really important to understand the difference between Classification and Regression problem as there are various parameters on the basis of which we train and tune our model.