The interquartile range is a widely accepted method to find outliers in data. When using the interquartile range, or IQR, the full dataset is split into four equal segments, or quartiles. The distances between the quartiles is what is used to determine the IQR.
Here’s how it works. Let’s say that we had a pretty diverse group of 15 friends with the following ages: 31, 21, 26, 30, 31, 45, 47, 32, 53, 54, 55, 38, 43, 57, 64. If we wanted to find the IQR, we would do the following:
- Order the ages from smallest to largest
- Find the middle value and create a group above and below this
- Find the middle value for each group that was created
- Find the difference between the middle of the top and bottom groups
Let’s work through that. First we start off will all of our ages unordered.
- Once we’ve ordered them from smallest to largest, they’ll look like this. You can also see their position below each number.
- If we have 15 ages, the middle age will be at the 8th position. As we can see, the age 43 is in the 8th position.
We can use the median to split our two groups. All ages between 21 and 38 are in the bottom group, and all of our ages between 45 and 64 are in the top group.
- Each group now has seven ages in it. So the middle value for each group will be in the 4th position.
- We calculate the interquartile range by first finding the value in the middle of the top group, which is 54 in this case. We then find the middle value in the bottom group, which is 31 in our example. The IQR is the difference between these two values. That is, 54 - 31, or, 23.
This method of breaking the groups in half, finding the middle number and repeating this for each half works perfectly with a collection of 15 ages. But we can still work out the interquartile range if we had an even number of ages and couldn’t find middle values. Let’s say we had these 12 ages, instead of our original 15.
With it ordered, it would look like this.
As we can see, because the total number is even, there isn’t a number that falls in the middle of the groups. And, if we split the groups in half, there also isn’t a number that falls in the middle of either half.
Instead of finding the middle number, we can break the ages in half, and then in half again.
In this case, the “middle” value, between each of the groups, is the average of the values on either side of the line:
(30 + 31)/2 = 31.5 (38 + 45)/2 = 41.5 (53 + 55)/2 = 54
This means that the interquartile range would be 54 - 31.5, or 22.5.
We can also refer to these values in the following way. For the above example:
The Q1 value is 31.5. This is also the 25th percentile.
That is, 25% of values are equal to or lower than 31.5.
The Q2 value is 41.5. This is also the 50th percentile or median.
That is, 50% of values are equal to or lower than 41.5.
The Q3 value is 54. This is also the 75th percentile.
That is, 75% of values are equal to or lower than 54.
The interquartile range, or IQR, is 22.5
Finding Outliers with the IQR
Minor Outliers (IQR x 1.5)
Now that we know how to find the interquartile range, we can use it to define our outliers. The most common method of finding outliers with the IQR is to define outliers as values that fall outside of 1.5 x IQR below Q1 or 1.5 x IQR above Q3.
Let’s break that down using our original example.
Our IQR was 23. If we multiply this by 1.5, we get 34.5. This means that we would consider any ages that are below -3.5 or above 88.5 to be outliers.
Notice that the thresholds for the outliers are simply defined by the data we use. Even though it’s not possible to have a negative age, our outlier calculation only considers the numerical values. In our case, because we are using ages, this means that no matter how young our friend may be, we would not consider them an outlier. We’d also need a friend who was 89 years or older to consider them an outlier.
In the case above, we have a pretty broad range of ages for our friends. What would happen if the range of ages for our friends was much smaller?
In this case, our Q1 value is 31 and our Q3 value is 35. This means that our IQR is only 4. Now, 1.5 times IQR is 6. Any values below 25, or higher than 41 will be considered outliers.
Now, our friends with the ages 21, 57, and 64 are considered outliers.
Major Outliers (IQR x 3)
This brings us to a second, less common threshold for assessing outliers. If we have a very small IQR, not all outliers are created equal. In the case above, while 21 and 64 are both outliers, 21 is only 10 years lower than our Q1 value of 31. But 64 is 30 years older than our Q3 value. How do we distinguish between “regular” outliers and “extreme” outliers.
A major outlier is defined as values that fall outside of 3 times IQR below Q1 or 3 times IQR above Q3.
If we go back to the previous example, 4 x 3 = 12. Major outliers will be those that are less than 19 and more 47.
This allows us to indicate some difference between 21 and the other two outliers. An age of 21 is not considered a major outlier, but 57 and 64 are major outliers.
Now that we know how to calculate our IQR and identify outliers, let’s look at how we can use SQL to find outliers using the IQR.