Find Duplicates

Last modified: June 22, 2021 • Reading Time: 5 minutes

To find duplicate values in SQL, you must first define your criteria for duplicates and then write the query to support the search.

Our sample table, called users, shows our Facebook friends and their relevant information. This information includes first and last names, gender and the date when the friend request was accepted. Here are a few sample rows:

user_id first_name last_name gender accepted_at
1 Carlo Thomas male 2020-04-17
2 Dom Scallion male 2011-01-30
3 Tania Lopes female 2020-01-03
4 Carlo Thomas male 2001-05-06
5 Samantha Fargo female 2019-03-25

Finding Duplicate Values within a Single Field

If we wanted to understand how many of our current Facebook friends had the same first name, we could do so by executing the following SQL query:

SELECT 
	first_name, 
	COUNT(*)
FROM 
	users
GROUP BY 
	first_name

In order to see how many of these names appear more often than others, you could add an additional ORDER BY statement to the end of the query and order by DESC.

first_name count
Tania 1
Carlo 2
Dom 1
Samantha 1

Additionally, you could add an additional HAVING clause to the above query like so:

SELECT 
	first_name, 
	COUNT(*)
FROM 
	users
GROUP BY 
	first_name
HAVING 
	COUNT (first_name) > 1

With the following sample users table, we can see that the first_name Carlo appears more than once and therefore you have more than one friend named Carlo on your Facebook friends list. The HAVING clause in this query deliberately selects instances where the field in the SELECT statement appears more than once and anything that appears in the result is a duplicate value in the table.

first_name count
Carlo 2

Finding Duplicate Values Across 2 or More Fields

Similarly, if you wanted to understand how many Facebook friends have created more than one account, you could run the following query:

SELECT 
	first_name, 
	last_name, 
	COUNT(*)
FROM 
	users
GROUP BY 
	first_name, 
	last_name
HAVING 
	COUNT(*) > 1
user_id first_name last_name
1 Carlo Thomas
4 Carlo Thomas

By executing this query we see that your friend Carlo Thomas has created two Facebook accounts causing you to have duplicate accounts in your friends list. The advantage of these methods is that we get to easily determine how many users in your account show up more than once in your friends list by first and/or last name.

Row Number Approach

Another way to search for duplicate values is to use the ROW_NUMBER window function. We can use this function to number each row in the table where the parameters for the ranking are determined by the partition by. This method is most useful when there are parameters included with ranking the duplicate records. For more information on how window functions work see here.

WITH
facebook_friends_deduped AS (
	SELECT
		first_name,
		last_name,
		ROW_NUMBER() OVER (PARTITION BY 
        	first_name, last_name ORDER BY accepted_at ASC)
 			AS occurrence
	FROM
		users
	)
SELECT
	*
FROM
	facebook_friends_deduped
WHERE
	rank > 1

The output of this query would look something like this:

user_id first_name last_name accepted_at occurrence
1 Carlo Thomas 2020-04-17 2

The above query determined that Carlo Thomas was your duplicate Facebook friend. The ROW_NUMBER function determines which Carlo Thomas account was accepted first (the earliest date value). The first instance of Carlo Thomas in our Facebook friends list is in 2001 and therefore when we added him again in 2020, he became a duplicate record in our list of Facebook friends.

Using CASE statements

Utilizing the Row Number technique from above, we can additionally use a CASE statement to insert a new boolean field into the table that describes whether the row is duplicated or not. See how CASE WHEN works here.

SELECT
	*,
	CASE 
		WHEN ROW NUMBER() OVER (PARTITION BY 
 			first_name, last_name > 1 THEN TRUE 
			ELSE FALSE
	END AS is_duplicated
FROM
	users

This method would return the following result:

user_id first_name last_name gender accepted_at is_duplicated
1 Carlo Thomas male 2020-04-17 TRUE
2 Dom Scallion male 2011-01-30 FALSE
3 Tania Lopes female 2020-01-03 FALSE
4 Carlo Thomas male 2001-05-06 TRUE
5 Samantha Fargo female 2019-03-25 FALSE

Here, we can see that Carlo Thomas has a TRUE value in the is_duplicated field created by the CASE statement in the query above. The drawback of using this method is that we can not easily identify which user record was created in the database first.

Overall, all of these methods are important and can be used to locate duplicate records. Depending on the size of your table, the level of granularity you need to provide and what the business decision is after you’ve located the duplicate records, each of these techniques may be used over another in different contexts. Similarly, if your work also includes deleting duplicate rows there are additional CTEs (common table expressions) that can execute that body of work.

Written by: Katherine Chiodo
Reviewed by: Matt David

Next – PostgreSQL Generate_Series

Get new data chapters sent right to your Inbox