Find Duplicates

Last modified: January 14, 2021 • Reading Time: 5 minutes

To find duplicate values in SQL, you must first define your criteria for duplicates and then write the query to support the search.

Our sample table, called users, shows our Facebook friends and their relevant information. This information includes first and last names, gender and the date when the friend request was accepted. Here are a few sample rows:

user_id	first_name	last_name	gender	accepted_at
1	Carlo	Thomas	male	2020-04-17
2	Dom	Scallion	male	2011-01-30
3	Tania	Lopes	female	2020-01-03
4	Carlo	Thomas	male	2001-05-06
5	Samantha	Fargo	female	2019-03-25

Finding Duplicate Values within a Single Field

If we wanted to understand how many of our current Facebook friends had the same first name, we could do so by executing the following SQL query:

SELECT 
	first_name, 
	COUNT(*)
FROM 
	users
GROUP BY 
	first_name

In order to see how many of these names appear more often than others, you could add an additional ORDER BY statement to the end of the query and order by DESC.

first_name	count
Tania	1
Carlo	2
Dom	1
Samantha	1

Additionally, you could add an additional HAVING clause to the above query like so:

SELECT 
	first_name, 
	COUNT(*)
FROM 
	users
GROUP BY 
	first_name
HAVING 
	COUNT (first_name) > 1

With the following sample users table, we can see that the first_name Carlo appears more than once and therefore you have more than one friend named Carlo on your Facebook friends list. The HAVING clause in this query deliberately selects instances where the field in the SELECT statement appears more than once and anything that appears in the result is a duplicate value in the table.

first_name	count
Carlo	2

Finding Duplicate Values Across 2 or More Fields

Similarly, if you wanted to understand how many Facebook friends have created more than one account, you could run the following query:

SELECT 
	first_name, 
	last_name, 
	COUNT(*)
FROM 
	users
GROUP BY 
	first_name, 
	last_name
HAVING 
	COUNT(*) > 1

user_id	first_name	last_name
1	Carlo	Thomas
4	Carlo	Thomas

By executing this query we see that your friend Carlo Thomas has created two Facebook accounts causing you to have duplicate accounts in your friends list. The advantage of these methods is that we get to easily determine how many users in your account show up more than once in your friends list by first and/or last name.

Row Number Approach

Another way to search for duplicate values is to use the ROW_NUMBER window function. We can use this function to number each row in the table where the parameters for the ranking are determined by the partition by. This method is most useful when there are parameters included with ranking the duplicate records. For more information on how window functions work see here.

WITH
facebook_friends_deduped AS (
	SELECT
		first_name,
		last_name,
		ROW_NUMBER() OVER (PARTITION BY 
        	first_name, last_name ORDER BY accepted_at ASC)
 			AS occurrence
	FROM
		users
	)
SELECT
	*
FROM
	facebook_friends_deduped
WHERE
	rank > 1

The output of this query would look something like this:

user_id	first_name	last_name	accepted_at	occurrence
1	Carlo	Thomas	2020-04-17	2

The above query determined that Carlo Thomas was your duplicate Facebook friend. The ROW_NUMBER function determines which Carlo Thomas account was accepted first (the earliest date value). The first instance of Carlo Thomas in our Facebook friends list is in 2001 and therefore when we added him again in 2020, he became a duplicate record in our list of Facebook friends.

Using CASE statements

Utilizing the Row Number technique from above, we can additionally use a CASE statement to insert a new boolean field into the table that describes whether the row is duplicated or not. See how CASE WHEN works here.

SELECT
	*,
	CASE 
		WHEN ROW NUMBER() OVER (PARTITION BY 
 			first_name, last_name > 1 THEN TRUE 
			ELSE FALSE
	END AS is_duplicated
FROM
	users

This method would return the following result:

user_id	first_name	last_name	gender	accepted_at	is_duplicated
1	Carlo	Thomas	male	2020-04-17	TRUE
2	Dom	Scallion	male	2011-01-30	FALSE
3	Tania	Lopes	female	2020-01-03	FALSE
4	Carlo	Thomas	male	2001-05-06	TRUE
5	Samantha	Fargo	female	2019-03-25	FALSE

Here, we can see that Carlo Thomas has a TRUE value in the is_duplicated field created by the CASE statement in the query above. The drawback of using this method is that we can not easily identify which user record was created in the database first.

Overall, all of these methods are important and can be used to locate duplicate records. Depending on the size of your table, the level of granularity you need to provide and what the business decision is after you’ve located the duplicate records, each of these techniques may be used over another in different contexts. Similarly, if your work also includes deleting duplicate rows there are additional CTEs (common table expressions) that can execute that body of work.

Written by: Katherine Chiodo
Reviewed by: Matt David

Created by