GROUP BY
Last modified: December 09, 2019
So far our aggregation functions have run across all of the data, but it’s often useful to split the aggregation into groups.
Let’s say for example that we wanted to get not a count of all of the tracks, but how many tracks were in each genre. One way of doing this would be to write a separate query for each genre like this:
SELECT COUNT(*) FROM tracks WHERE genre_id = 1;
SELECT COUNT(*) FROM tracks WHERE genre_id = 2;
SELECT COUNT(*) FROM tracks WHERE genre_id = 3;
.
.
.
SELECT COUNT(*) FROM tracks WHERE genre_id = n;
But we’d have to know what all the genre_id’s were and use some other tool to combine all of the results back together. Not ideal.
Luckily, we have the GROUP BY clause which makes this a whole lot simpler. The GROUP BY clause tells the database how to group a result set, so we can more simply write the queries above as:
How cool is that?! Can you get a count of all tracks by composer?
It’s useful here to order the results of this query by the count, so we can see which composers have produced the largest number of tracks (at least in our database).
Above, the NULL composer is being counted as having the most tracks. That’s just noise. Using what we just learned abour NULL operators, can you modify the query to filter out the NULL composers?
Multiple GROUP BYs
You can group by more than one thing, and it simply creates a second set of groups inside the first set. Try running the following example which groups first by genre and then by composer.
The priority/order of the groups is the same as how you list them. You can see that switching the order of genre_id and composer in the GROUP BY clause makes quite a different query:
Notice that I also added ORDER BY clauses to make the output a little more clear. ORDER BY’s are quite useful and common when using GROUP BY.
GROUP BY Rules
There are a few rules to follow when using GROUP BYs. The largest is that all data that isn’t listed as a parameter to GROUP BY needs an aggregation function applied to it. Think of what the following query:
SELECT genre_id, unit_price FROM tracks GROUP BY genre_id;
It throws an error because the database doesn’t know what to do about unit_price. While there is only one genre_id per group, there are many unit_prices. They all can’t just be output as a value without some aggregation function.
Can you correct the above query to get the average unit_price by genre_id?
GROUP BY Errors
It’s easy to forget this rule and if so you’re going to see an error like the following
ERROR: column "tracks.composer" must appear in the GROUP BY clause or be used in an aggregate function LINE 1: SELECT genre_id, composer FROM tracks GROUP BY genre_id;
Just remember that that means you have to either add that column to the GROUP BY or apply an aggregation function to it so the database knows what to do.
The following example will throw this error because the database doesn’t know what to do with all of the unit prices. Can you modify it to do return the average unit_price by genre_id?
Written by:
Dave Fowler
Reviewed by:
Matt David