Finding duplicate records in a MySQL table

So, you imported a large CSV file. You do not have a UNIQUE index on one of the fields, for example, the “student_number” field. Your client asks you to find a list of the duplicates. What do you do?

You write an SQL query, of course!

Assuming we want to de-duplicate the entire student database from the “students” table, by finding duplicates of the field “student_number”. Here is how you go about it:

First you write the INNER query:

SELECT `student_number` FROM `students` GROUP BY `student_number` HAVING COUNT(`id`) > 1

You now assign the alias “duplicate” to this query:

(SELECT `student_number` FROM `students` GROUP BY `student_number` HAVING COUNT(`id`) > 1) duplicate

Next, you write the OUTER query, and specify the match fields:

SELECT `first_name`, `last_name`, students.student_number FROM `students`
INNER JOIN __________ ON students.student_number = duplicate.student_number

Finally, you drop in the INNER query into the _____ in the OUTER query, leaving you with:

SELECT `first_name`, `last_name`, students.student_number FROM `students`
INNER JOIN (SELECT `student_number` FROM `students` GROUP BY `student_number` HAVING COUNT(`id`) > 1) duplicate ON students.student_number = duplicate.student_number

The OUTER query contain the following elements:

* The fields that you want to show when displaying the records (first_name, last_name)
* The field that you want to de-duplicate (student_number)

The INNER query does the actual magic by joining the same table with itself on the student_number field, finding all student_numbers that have more than one ID.

There you have it. Simple, right?

Leave a Reply

Your email address will not be published.