STATISTICS FOR DATA SCIENCE AND MACHINE LEARNING

Pearson Correlation Coefficient

The relation between continuous variables in Statistics

Adrià Serra
5 min readAug 6, 2020

--

In the last post, we analyzed the relationship between categorical variables and categorical and continuous variables. In this case, we will analyze the relation between two ratio level or continuous variables.

Peason’s Correlation, sometimes just called correlation, is the most used metric for this purpose, it searches the data for a linear relationship between two variables.

Analyzing the correlations is one of the first steps to take in any statistics, data analysis, or machine learning process, it allows data scientists to early detect patterns and possible outcomes of the machine learning algorithms, so it guides us to choose better models.

Correlation is a measure of relation between variables, but cannot prove causality between them.

Some examples of random correlations that exist in the world are found un this website.

This example is taken from https://tylervigen.com/spurious-correlations.

In the case of the last graph, it’s clearly not true that one of these variables implies the other one, even having a correlation of 99.79%

Scatterplots

To take the first look to our dataset, a good way to start is to plot pairs of continuous variables, one in each coordinate. Each point on the graph corresponds to a row of the dataset.

Scatterplots give us a sense of the overall relationship between two variables:

  • Direction: positive or negative relation, when one variable increases the second one increases or decreases?
  • Strength: how much a variable increases when the second one increases.
  • Shape: The relation is linear, quadratic, exponential…?

Using scatterplots is a fast technique for detecting outliers if a value is widely separated from the rest, checking the values for this individual will be useful.

We will go with the most used data frame when studying machine learning, Iris, a dataset that contains information about iris plant flowers, and the objective of this one is to classify the flowers into three groups: (setosa, versicolor, virginica).

Scatter plot of two iris dataset variables, self-generated.

The objective of the iris dataset is to classify the distinct types of iris with the data that we have, to deliver the best approach to this problem, we want to analyze all the variables that we have available and their relations.

In the last plot we have the petal length and width variables, and separate the distinct classes of iris in colors, what we can extract from this plot is:

  • There’s a positive linear relationship between both variables.
  • Petal length increases approximately 3 times faster than the petal width.
  • Using these 2 variables the groups are visually differentiable.

Scatter Plot Matrix

To plot all relations at the same time and on the same graph, the best approach is to deliver a pair plot, it’s just a matrix of all variables containing all the possible scatterplots.

As you can see, the plot of the last section is in the last row and third column of this matrix.

Pair plot of two iris dataset variables, self-generated.

In this matrix, the diagonal can show distinct plots, in this case, we used the distributions of each one of the iris classes.

Being a matrix, we have two plots for each combination of variables, there’s always a plot combining the same variables inverse of the (column, row), the other side of the diagonal.

Using this matrix we can obtain all the information about all the continuous variables in the dataset easily.

Pearson Correlation Coefficient

Scatter plots are an important tool for analyzing relations, but we need to check if the relation between variables is significant, to check the lineal correlation between variables we can use the Person’s r, or Pearson correlation coefficient.

The range of the possible results of this coefficient is (-1,1), where:

  • 0 indicates no correlation.
  • 1 indicates a perfect positive correlation.
  • -1 indicates a perfect negative correlation.

To calculate this statistic we use the following formula:

Peason’s correlation formula, self-generated.

Test significance of correlation coefficient

We need to check if the correlation is significant for our data, as we already talked about hypothesis testing, in this case:

  • H0 = The variables are unrelated, r = 0
  • Ha = The variables are related, r ≠ 0

This statistic has a t-student distribution with (n-2) degrees of freedom, being n the number of values.

The formula for the t value is the following, and we need to compare the result with the t-student table.

Peason’s correlation t-student formula, self-generated.

If our result is bigger than the table value we reject the null hypothesis and say that the variables are related.

Coefficient of determination

To calculate how much the variation of a variable can affect the variation of the other one, we can use the coefficient of determination, calculated as the . This measure will be very important in regression models.

Summary

In the last post, we talked about correlation for categorical data and mentioned that the correlation for continuous variables is easier, in this case, we explained how to perform this correlation analysis and how to check if it’s statistically significant.

Adding to the typical analysis of the statistical significance will give a better understanding about how to use each variable.

This is the eleventh post of my particular #100daysofML, I will be publishing the advances of this challenge at GitHub, Twitter, and Medium (Adrià Serra).

https://twitter.com/CrunchyML

https://github.com/CrunchyPistacho/100DaysOfML

--

--

Data scientst, this account will share my blog post about statistic, probability, machine learning and deep learming. #100daysofML