Finding Correlation in Data

What is Correlation?

Correlation refers to an association or relationship between variables. It is used to express one variable in terms of its relationship with others. For example, a company’s sales might increase when the marketing team spends more on online advertisements, or a customer’s average spending amount at a store might depend on a number of factors related to that customer, such as annual household income. Generally, correlation is the initial step to understanding these relationships and eventually building better statistical and business models.

What is Correlation Coefficient?

The Correlation Coefficient, a value between -1 and +1, tells us how strongly two variables are related to each other. A correlation coefficient of +1 indicates a perfect positive correlation, which means that as variable X increases, variable Y increases. Similarly, as variable X decreases, variable Y decreases.

A correlation coefficient of -1 indicates a perfect negative correlation, which means that as variable X increases, variable Y decreases. Similarly, as variable X decreases, variable Y increases.

A correlation coefficient of 0 or near 0 indicates no correlation i.e. no linear relationship between variables can be noticed.

Correlation Graphs

  (Source: https://www.mathsisfun.com/data/correlation.html)
  • The graphs represent linear correlations (correlations that follow a line).
  • Two datasets are said to have a high correlation when they are strongly linked together.
  • When the variables increase together, correlation is Positive.
  • When one variable decreases as the other increases, correlation is Negative.
  • The correlation value shows how strong the correlation is, and not how steep the line is, and if it is positive or negative.

Example Case: Ice Cream Sales Versus Temperature

Let’s consider the following data set of ice cream sales versus the temperature on that day.

We’ll use the same dataset for the examples and calculations that follow.

Correlation Data Set

How to Calculate Correlation

Correlation can be calculated in several ways. Let’s discuss the following three ways:

1. Pearson’s Correlation Coefficient

Pearson’s Correlation Coefficient (PCC, or Pearson’s r) is a widely used linear correlation measure. It’s often the first one taught in many elementary statistics courses. Mathematically speaking, it is defined as: “The covariance between two vectors, normalized by the product of their standard deviations.”

Pearson's Correlation Coefficient Formula

 

Let’s look at a simple way of finding the Pearson Correlation Coefficient. Label two sets of data – Temperature °C as “X” and Ice Cream Sales as “Y” (in Microsoft Excel):

  1. Find the mean of X and Y
  2. Subtract the mean of X from every X value (label them ‘“a”). Similarly, do the same for Y (label them “b”).
  3. Calculate ab, a² and for every value.
  4. Sum up ab, sum up and sum up .
  5. Divide the sum of ab by the square root of [(sum of a²) x (sum of b²)]

Correlation Coefficient

The Pearson Correlation Coefficient can also be calculated in SQL, Python and other programming languages too.

2. Analysis ToolPak in Excel

You can use the Analysis ToolPak add-in in Excel to quickly generate correlation coefficients between multiple variables.

NOTE: Analysis ToolPak does not come by default with Excel, see the video below to add it if you haven’t already.

  1. On the Data tab, in the Analysis group, click on Data Analysis.
    Data Analysis in Excel
  2. Select Correlation and click OK.
    Data Analysis Correlation
  3. Let’s use the same Ice cream sales vs. Temperature data set for this example. Select A1:B11 as the Input Range.
  4. Check Labels in first row.
  5. Select cell D2 as the Output Range.
  6. Click OK.
    Finding Correlation in Excel
  7. Result
    Correlation Analysis in ExcelConclusion – The two variables in this data are positively correlated (0.9843)

3. CORREL Function

The Excel Correl function calculates the Pearson Correlation Coefficient for two sets of values.

The syntax of the function is: CORREL(array1, array2)

Where array1 is a set of independent variables and array2 is a set of dependent variables. These arrays should be of equal length.

NOTE: CORREL function ignores text values and logical values that are supplied as part of an array.

CORREL function in Excel

Additionally, you can check how to use the PostgreSQL Correlation Function in Chartio.

Correlation is NOT Causation

“Correlation is NOT Causation” posits that a correlation does not mean that the two variables have a cause and effect relationship. There could be other reasons why a particular data set has a good correlation. The observed correlation could be due to the effects of a hidden third variable, or just entirely down to chance.

For example, let’s consider a new dataset (Ice cream sales and Sunglasses sold each day).

Correlation is not Causation

As you can see, the correlation between these two sets of data is highly positive. Does this mean that sunglasses make people want ice cream? Without further research, one can’t be sure why these two variables are correlated. In this case, further research revealed that a hidden third variable, namely temperature, caused this effect. You can see that while an increase in temperature causes both sunglasses and ice cream sales to go up, sunglasses and ice cream sales do not cause each other to go up. Each of them is independently caused to go up by the increase in temperature, thus explaining the high positive correlation. That said, correlation does allow for predictions about one variable to be made based upon another.

Ridhima Rao Donthineni

About Ridhima Rao Donthineni