Close
Data TutorialsPostgreSQL

The Pearson Correlation Coefficient Formula in SQL

Posted by Tim Miller

The Pearson Correlation Coefficient Formula in SQL

As we discussed on our “Using the PostgreSQL Correlation Function” page, using the correlation can show you how two series of numbers are related. Either their strength as a positive correlation or their strength as a negative correlation, and any amount of strength in between, including no correlation at all.

To recap a little bit on the Correlation Coefficient:

The Correlation Coefficient is a widely used method of determining the strength of the relationship between two numbers or two sets of numbers. This coefficient is calculated as a number between -1 and 1. 1 being the strongest possible positive correlation and -1 being the strongest possible negative correlation.

alt_text

A positive correlation means that as one number increases the second number will also increase.

A negative correlation means that as one number increases the second number decreases.

Whether or not the outcome of the second number is CAUSED by the first is not being determined here, just that the outcomes of the two numbers happen in concert with each other.

If the formula returns 0 then there is absolutely NO correlation between the two sets of numbers.

Pearson’s Formula looks like this:

As was outlined, there are ways to calculate the Pearson Formula for a set of numbers in SQL.

We did it here on a number set from out demo data source, that comes free with a trial.

That formula in SQL looks like this:

((tot_sum - (amt_sum * act_sum / _count)) / sqrt((amt_sum_sq - pow(amt_sum, 2.0) / _count) * (act_sum_sq - pow(act_sum, 2.0) / _count)))

And used in an entire query like this:

SELECT
    ((tot_sum - (amt_sum * act_sum / _count)) / sqrt((amt_sum_sq - pow(amt_sum, 2.0) / _count) * (act_sum_sq - pow(act_sum, 2.0) / _count))) AS "Corr Coef Using Pearson"


FROM(
SELECT
    sum("Amount") AS amt_sum,
    sum("Activities") AS act_sum,
    sum("Amount" * "Amount") AS amt_sum_sq,
    sum("Activities" * "Activities") AS act_sum_sq,
    sum("Amount" * "Activities") AS tot_sum,
    count(*) as _count

FROM(
SELECT
    DATE_TRUNC('day', p.payment_date)::DATE AS "Day",
    SUM(p.amount) AS "Amount",
    COUNT(DISTINCT a.activity_id) AS "Activities"
FROM
    public.payments p
    INNER JOIN public.subscriptions s ON p.subscription_id = s.subscription_id
    INNER JOIN public.users u ON s.user_id = u.user_id
    INNER JOIN public.activity a ON a.user_id = u.user_id

GROUP BY 1) as a

) as b

GROUP BY tot_sum, amt_sum, act_sum, _count, amt_sum_sq, act_sum_sq

What we would see from this query is exactly what we would see using the corr() function in PostgreSQL: