How to Find Outliers with SQL

Last modified: July 03, 2019

Use ORDER BY

A fast way to identify outliers is to sort the relevant values in both ascending and descending order. This allows you to quickly skim through the highest and lowest values. If you have a sense of what you are expecting from your data, this can help you quickly identify any unexpected values.

If we were looking at the ages of friends, we’d expect to see values that fall into the range of ages for adults. Let’s take a look at what we see in the example below.

To sort our values, we could use the following query.

SELECT full_name,
age
FROM friends
ORDER BY age

Here’s what a sample query could look like.

Sample of query sorted by age

By sorting we can easily identify anomalies within the data. We don’t expect any of our friends to have an age of 1, so this outlier is likely an error.

Because this is a small dataset, we can also view the final lines in our results. An age of 999 is not a possible value for the age of our friends, and we can see that we have a missing value.

If we had a larger dataset but wanted to quickly view the end of our data, we could do so by adding DESC to our ORDER BY statement.

SELECT full_name,
age
FROM friends
ORDER BY age DESC

This method can be great for identifying obvious outliers that quickly stand out. However, it may miss some other details in the data. For example, if we have less knowledge of the expected values in our dataset, there may be enough outliers that they don’t stand out with just a quick look at the high and low values.

It’s often helpful to be able to use a more statistical method for identifying outliers.

Using NTILE

One statistical method of identifying outliers is through the use of the interquartile range, or IQR. When we find values that fall outside of 1.5 times the range between our first and third quartiles, we typically consider these to be outliers.

SQL has a function that allows us to easily separate our values into our four quartiles. When we use NTILES() we separate our data into the same number of groups as the value inside the brackets. Therefore, if we want to separate our data into quartiles we would use NTILE(4).

NTILE is used in conjunction with a window function. If we use a similar example of friends’ ages, this is what the syntax would look like this:

SELECT full_name,
age,
NTILE(4) OVER (ORDER BY age) AS age_quartile
FROM friends

Showing how to make a quartile using ntile(4)

When using NTILE() in SQL, if we have an odd number of values in each of our quartiles, the maximum value in the first quartile will be the Q1 value, and the maximum value in the third quartile will be the Q3 value.

In our case, the Q1 value is 31, and the Q2 value is 35. We can easily identify these values using a subquery.

SELECT age_quartile,
MAX(age) AS quartile_break
FROM(SELECT full_name,
age,
NTILE(4) OVER (ORDER BY age) AS age_quartile
FROM friends) AS quartiles
WHERE age_quartile IN (1, 3)
GROUP BY age_quartile

Results from the sql written above

This gives us an IQR of 4, and 1.5 x 4 is 6.

To find the lower threshold for our outliers we subtract from our Q1 value:

31 - 6 = 25

To find the upper threshold for our outliers we add to our Q3 value:

35 + 6 = 41

We can then use WHERE to filter values that are above or below the threshold.

SELECT full_name,
age
FROM friends
WHERE age < 25 OR age > 41

Values returned by the sql written above

Fully Query

While the above is great for explaining what we’re doing when finding outliers, it does have some issues. It works if the number of values in each half are odd, but we would need to make adjustments if it was even. Also, it requires us to write two different queries to get our results.

It can be helpful to be able to run a single query that pulls the results. Here’s one way to do that:

with orderedList AS (
SELECT full_name,
age,
ROW_NUMBER() OVER (ORDER BY age) AS row_n
FROM friends
),
iqr AS (
SELECT age, full_name,
(
SELECT age AS quartile_break
FROM orderedList
WHERE row_n = FLOOR((SELECT COUNT(*)
FROM friends)*0.75)
) AS q_three,
(
SELECT age AS quartile_break
FROM orderedList
WHERE row_n = FLOOR((SELECT COUNT(*)
FROM friends)*0.25)
) AS q_one,
1.5 * ((
SELECT age AS quartile_break
FROM orderedList
WHERE row_n = FLOOR((SELECT COUNT(*)
FROM friends)*0.75)
) - (
SELECT age AS quartile_break
FROM orderedList
WHERE row_n = FLOOR((SELECT COUNT(*)
FROM friends)*0.25)
)) AS outlier_range
FROM orderedList
)

SELECT full_name, age
FROM iqr
WHERE age >= ((SELECT MAX(q_three)
FROM iqr) +
(SELECT MAX(outlier_range)
FROM iqr)) OR
age <= ((SELECT MAX(q_one)
FROM iqr) -
(SELECT MAX(outlier_range)
FROM iqr))

This query uses ROW_NUMBER and FLOOR (always rounds down) to find the Q1 (at the row 25% of the way through) and Q3 (at the row 75% of the way through) values. From there, we can calculate the IQR x 1.5 and used these as reference for our outliers.

Technical Note

The above query will be 100% accurate if each half of our data has an odd number of values. To check this, we would count the total number of rows and divide this count in half. If the result end in .5, we would round down. From there, we can determine if there are an odd or even number of values in each half.

For example, if I have a total of 55 values, half of this is 27.5. If I round down, I have 27, which is odd and therefore the query will be entirely accurate.

If the value is even, for example, I had 26 values in each half instead of 27, then I would need to adjust the query to be entirely accurate. In this case I would need the average of the 13th and the 14th values to find Q1 and similarly for Q3.

To do this, the above query can be adjusted as follows:

with orderedList AS (
SELECT full_name,
age,
ROW_NUMBER() OVER (ORDER BY age) AS row_n
FROM friends
),

quartile_breaks AS (
SELECT age, full_name,
(
SELECT age AS quartile_break
FROM orderedList
WHERE row_n = FLOOR((SELECT COUNT(*) FROM friends)*0.75)
) AS q_three_lower,
(
SELECT age AS quartile_break
FROM orderedList
WHERE row_n = FLOOR((SELECT COUNT(*) FROM friends)*0.75) + 1
) AS q_three_upper,
(
SELECT age AS quartile_break
FROM orderedList
WHERE row_n = FLOOR((SELECT COUNT(*) FROM friends)*0.25)
) AS q_one_lower,
(
SELECT age AS quartile_break
FROM orderedList
WHERE row_n = FLOOR((SELECT COUNT(*) FROM friends)*0.25) + 1
) AS q_one_upper
FROM orderedList
),

iqr AS (
SELECT age, full_name,
(
(SELECT MAX(q_three_lower) FROM quartile_breaks) +
(SELECT MAX(q_three_upper) FROM quartile_breaks)
)/2 AS q_three,
(
(SELECT MAX(q_one_lower) FROM quartile_breaks) +
(SELECT MAX(q_one_upper) FROM quartile_breaks)
)/2 AS q_one,
1.5 * ((
(SELECT MAX(q_three_lower) FROM quartile_breaks) +
(SELECT MAX(q_three_upper) FROM quartile_breaks)
)/2 - (
(SELECT MAX(q_one_lower) FROM quartile_breaks) +
(SELECT MAX(q_one_upper) FROM quartile_breaks)
)/2) AS outlier_range
FROM quartile_breaks
)

SELECT full_name, age
FROM iqr
WHERE age >= ((SELECT MAX(q_three)
FROM iqr) +
(SELECT MAX(outlier_range)
FROM iqr)) OR
age <= ((SELECT MAX(q_one)
FROM iqr) -
(SELECT MAX(outlier_range)
FROM iqr))

Written by: Rebecca Barnes
Reviewed by: Matt David

Next – How Regex in SQL Works

Get new data chapters sent right to your Inbox