Regression vs Classification in Machine Learning

What is Machine Learning?

Machine Learning is a subset of Artificial Intelligence which lets the machine to discover pattern from the data, draw insights and helps in decision making.

The mathematical model is built and it is trained on the sample data. Different types of learning are Supervised Learning, Unsupervised Learning, semi-supervised learning, Reinforcement Learning, Self-Learning, etc.

In Supervised Learning, the prediction model is built from the set of training data which is labeled (i.e. data contains both the input and output). Regression algorithms and Classification algorithms are the types of supervised learning.

What is Regression in Machine Learning

Francis Galton coined the term “Regression” in the context of biological phenomenon. The work was later extended to general statistical context by Karl Pearson and Udny Yule. Regression analysis is the statistical method that derives the relationship and determines the strength among dependent variables and one or more independent variables. Regression analysis also indicates the impact of independent variables on the dependent variable.

The meaning of Regression is any procedure that tries to find the relationship between variables.

When Regression is used?

Regression is used when we want to predict the output variable which is continuous or a real value.

For example:

  • Predicting the Price of House
  • Predicting the Height of the person
  • Predicting the salary of the person
  • Predicting Temperature

There are different Regression modeling techniques. Among all Linear Regression is the most popular algorithm. It is the simplest form of Regression.

The Linear Regression is represented by y=ax+b+e where a is the slope of the line and b is the intercept of the line and e is the error.

In the diagram below blue dots are the observed data points and the red line is the line of best fit.

 

There are many different types of Regression algorithm like Linear Regression, Polynomial Regression, Lasso Regression, Ordinal Regression, Quantile Regression, ElasticNet Regression, Stepwise Regression, Poisson Regression, Cox Regression etc.

In multiple regression, there is more than 1 independent variable.

What is Classification in Machine Learning?

The Classification Algorithm is used when we want to predict the output variable which is discrete. The dependent variable is predicted by analyzing the dependent variables.

The main goal of classification is to identify the category of the dependent variables based on training data.

For Example:

Classification of fruits (by analyzing the properties – color, size, texture, etc.)

Classification of Animals (input images)

Face Recognition

Email spam identification

Sentiment Analysis

The Different Classification Algorithms are:

  • Logistic Regression (Linear Classifier)
  • Naïve Bayes
  • Nearest Neighbour
  • Support Vector Machine
  • Decision Tree
  • Boosted Trees
  • Random Forest
  • Neural Networks etc.

5 MACHINE LEARNING ALGORITHMS EVERY DATA SCIENTIST MUST KNOW BY HEART

The Logistic Regression finds the probability of a certain class or event. Using Logit Function, it simply predicts the probability of the occurrence of an event. Suppose if we want to find whether the person is diabetic or not based on his age, Blood pressure (bp), and sex.

More formally it can be written as

P(Disease=Age|(Blood Pressure)BP|sex)

In this example Diabetes is dependent variable and age, BP, sex is the independent variable.

A Problem where the outcome is of two classes is known as a binary classification problem.

A Problem where the outcome is more than two classes is known as Multi-class classification.

A problem where a data point is assigned multiple labels is known as Multi-Label Classification Problem

The Types of Logistic regression is

  1.  Binary Logistic Regression: The Outcome is of two classes. E.g.: spam or not spam
  2.  Multinomial Logistic Regression: More than two outcome classes without any order E.g.: shape – rectangle, round, triangle
  3.  Ordinal Logistic Regression: More than two outcome classes with ordering. E.g.: Grades –  Distinction, First class, Second class

To select the right algorithm for modeling it is very important to understand whether the problem is a classification problem or a Regression problem.

Performance Evaluation of Classification and Regression:

It is very important to evaluate the performance of the model. Both classification and Regression has various methods, formulas and techniques to evaluate the performance of an Algorithm.

Performance metrics for Regression:

The different metrics for Regression problems are:

  1. Mean Absolute Error (MAE): average squared difference between the estimated values and the actual value.
  2. Root Mean Squared Error (RMSE): Difference between the predicted values and the observed values. It measures the spread of the residuals.
  3. R – squared: known as the coefficient of determination which tells the percentage of points falls on the regression line.
  4. Adjusted R square: It indicates how well the data points fir the curve. It considers the significant data points only.

Performance Metrics For Classification :

To calculate the performance different metrics are used but apart of metrics specific data is required to calculate the performance of the model that is True positive, True Negative, False positive, False Negative. To get visual matrix python provides a confusion matrix which is a ski kit-learn library.

Based on this information we calculate:

Accuracy: Accuracy is a number of predictions our model got correct.

Accuracy = Correct Predictions  / Total Number of Predictions

Precision: Ratio of Correct positive observations to total predicted positive observation.

Precision = TP/TP+FP

Recall: It is the ratio of positively predicted observations to actual observations. The recall is also known as sensitivity.

Recall = TP/TP+FN

Specificity: It measures the True Negative Rate.

F1 score: It is a harmonic mean of Precision and Recall.

ROC/AUC curve: It shows the performance of the model at thresholds by plotting a graph of True positive rate against False positive rate. AUC is the Area under the ROC curve.

Log loss: It measures the performance where the prediction input is a probability value between 0 and 1.

Classification Vs Regression

PARAMETER  CLASSIFICATION  REGRESSION
 Prediction  The output variable is   discrete in nature  The output variable is   continuous in nature
 Find  Decision boundary  Best Fit line
 Output Data   Unordered  Ordered
 Evaluation  calculate accuracy  Calculate the sum of   squared errors, R-   squared
 Example   Algorithms   logistic regression, Decision Tree, Random Forest etc  Linear Regression,   Polynomial Regression etc.

To choose the best model for your specific use case it is really important to understand the difference between the Classification and Regression problem as there are various parameters on the basis of which we train and tune our model.

If you want to Learn Machine Learning then join Gyansetu’s Machine Learning with Python Training Course.  or Call Us at 8130799520