Linear regression

Parmita Biswas
Analytics Vidhya
Published in
4 min readSep 19, 2020

--

Photo by Clay Banks on Unsplash

Linear regression, one of the basic topics for all who wants to learn statistics. When you start speaking with numbers, they speak in terms of some values. We need to talk to numbers the they way they understand. The more you speak with numbers, more information they give you.

In this article we will learn what is linear regression, some basic terms with few graphs.

Few terms:

Independent and dependent variables:

The independent variable is the variable the changes or controls and is assumed to have a direct effect on the dependent variable. Two examples of common independent variables are gender and educational level.

The dependent variable is the variable being tested and measured in an experiment and is ‘dependent’ on the independent variable. An example of a dependent variable is depression symptoms, which depends on the independent variable (type of therapy).

Independent and Dependent Variables

Prerequisites for Regression

· The dependent variable Y has a linear relationship to the independent variable X.( we can check this by a simple scatter plot)

· For each value of X, the probability distribution of Y has the same standard deviation σ.

The Least squares Regression Line

Linear regression finds the straight line, called the least squares regression line. The regression line is :

Y=mX+c

C= constant

M = slope (in a equation of a line, and in regression it is called regression coefficient )

X = independent variable

Y = dependent variable

Let us study a quick example :

Perfect Linear relationship

In this above graph , we se the relationship between the marks of a student and hours he spends each day. We see the relationship in perfectly linear. We can see one more term as R square, where we see the value as 1.

R-squared is a statistical measure of how close the data are to the fitted regression line. It is also known as the coefficient of determination, or the coefficient of multiple determination for multiple regression.

The definition of R-squared is the percentage of the response variable variation that is explained by a linear model.

R-squared = Explained variation / Total variation

R-squared is always between 0 and 100%.

0% indicates that the model explains none of the variability of the response data around its mean.

100% indicates that the model explains all the variability of the response data around its mean.

Lets see one more example.

Linear equation — R square <1

In the above graph, we see the relationship is not perfectly linear. We have tried to make a linear line and tried a good fit. We also see the R square to be 0.95 (95%). This means that the model explains 95% of the variability of the response data around the mean. Or in other words 95% of the variance in Y (marks) is predictable for X ( hours).

Now let us see one more example.

Linear equation R square << 1

In the above example, we see that its very tough to have linear line connecting all data points. The R square value is 0.04 (4%) which is too low to say that the model fits well. Only 10% of the variance in Y (marks) is predictable for X (hours).

Hope with the above examples the interpretability of R square is pretty much clear.

Let us solve a simple problem statement:

Question: A student uses a regression equation to predict marks, based on hours of study. The correlation between predicted marks and time spent is 0.60. What is the correct interpretation of this finding?

Answer: The coefficient of determination measures the proportion of variation in the dependent variable that is predictable from the independent variable. The coefficient of determination is equal to R square ; in this case, (0.60) square or 0.36. Therefore, 36% of the variability in marks can be explained by time spent.

Congratulations, you did it.

For now, thank you all for making it this far. We have started with linear regression and its interpretability. We will deep dive further so that we are able to convert a mathematics problem into a simple layman language. One of the most important aspect of any problem statement is that how well u can explain that to business. For this we always need interpretability and explanation in a way that can be easily understood.

If you missed the previous articles on statistics , you can find them here.

And as always, if there are any question, remarks, or comments feel free to contact me!

Reference:

Statistics How To

--

--

Parmita Biswas
Analytics Vidhya

I am an enthusiast data scientist as well as a python developer. I have an overall ten years of industry experience.