5

min read

A fast way to identify outliers is to sort the relevant values in both ascending and descending order. This allows you to quickly skim through the highest and lowest values. If you have a sense of what you are expecting from your data, this can help you quickly identify any unexpected values.

If we were looking at the ages of friends, we’d expect to see values that fall into the range of ages for adults. Let’s take a look at what we see in the example below.

To sort our values, we could use the following query.

CODE:https://gist.github.com/fronofro/44f9be22b528ef6f20135b9b8a75857a.js

Here’s what a sample query could look like.

By sorting we can easily identify anomalies within the data. We don’t expect any of our friends to have an age of 1, so this outlier is likely an error.

Because this is a small dataset, we can also view the final lines in our results. An age of 999 is not a possible value for the age of our friends, and we can see that we have a missing value.

If we had a larger dataset but wanted to quickly view the end of our data, we could do so by adding DESC to our ORDER BY statement.

CODE:https://gist.github.com/fronofro/3279e0dafd695f9876b565c974601f63.js

This method can be great for identifying obvious outliers that quickly stand out. However, it may miss some other details in the data. For example, if we have less knowledge of the expected values in our dataset, there may be enough outliers that they don’t stand out with just a quick look at the high and low values.

It’s often helpful to be able to use a more statistical method for identifying outliers.

One statistical method of identifying outliers is through the use of the interquartile range, or IQR. When we find values that fall outside of 1.5 times the range between our first and third quartiles, we typically consider these to be outliers.

SQL has a function that allows us to easily separate our values into our four quartiles. When we use NTILES() we separate our data into the same number of groups as the value inside the brackets. Therefore, if we want to separate our data into quartiles we would use NTILE(4).

NTILE is used in conjunction with a window function. If we use a similar example of friends’ ages, this is what the syntax would look like this:

CODE:https://gist.github.com/fronofro/184c68ce4eac509a75f4d06c3b1cfb21.js

When using NTILE() in SQL, if we have an odd number of values in each of our quartiles, the maximum value in the first quartile will be the Q1 value, and the maximum value in the third quartile will be the Q3 value.

In our case, the Q1 value is 31, and the Q2 value is 35. We can easily identify these values using a subquery.

CODE:https://gist.github.com/fronofro/aa91f2e4ebdbe7471f665d0d2ca42826.js

This gives us an IQR of 4, and 1.5 x 4 is 6.

To find the lower threshold for our outliers we subtract from our Q1 value:

31 - 6 = 25

To find the upper threshold for our outliers we add to our Q3 value:

35 + 6 = 41

We can then use WHERE to filter values that are above or below the threshold.

CODE:https://gist.github.com/fronofro/1cdff4b7a192022fc23c296edf6f3a6f.js

While the above is great for explaining what we’re doing when finding outliers, it does have some issues. It works if the number of values in each half are odd, but we would need to make adjustments if it was even. Also, it requires us to write two different queries to get our results.

It can be helpful to be able to run a single query that pulls the results. Here’s one way to do that:

CODE:https://gist.github.com/fronofro/329101016b13ed2bc3a7799e08be4065.js

This query uses ROW_NUMBER and FLOOR (always rounds down) to find the Q1 (at the row 25% of the way through) and Q3 (at the row 75% of the way through) values. From there, we can calculate the IQR x 1.5 and used these as reference for our outliers.

The above query will be 100% accurate if each half of our data has an odd number of values. To check this, we would count the total number of rows and divide this count in half. If the result end in .5, we would round down. From there, we can determine if there are an odd or even number of values in each half.

For example, if I have a total of 55 values, half of this is 27.5. If I round down, I have 27, which is odd and therefore the query will be entirely accurate.

If the value is even, for example, I had 26 values in each half instead of 27, then I would need to adjust the query to be entirely accurate. In this case I would need the average of the 13th and the 14th values to find Q1 and similarly for Q3.

To do this, the above query can be adjusted as follows:

CODE:https://gist.github.com/fronofro/59895778b9f9ae304d002c14b68a5b7a.js

©2019 All Rights Reserved.