So, you imported a large CSV file. You do not have a UNIQUE index on one of the fields, for example, the “student_number” field. Your client asks you to find a list of the duplicates. What do you do?
You write an SQL query, of course!
Assuming we want to de-duplicate the entire student database from the “students” table, by finding duplicates of the field “student_number”. Here is how you go about it:
First you write the INNER query:
SELECT `student_number` FROM `students` GROUP BY `student_number` HAVING COUNT(`id`) > 1
You now assign the alias “duplicate” to this query:
(SELECT `student_number` FROM `students` GROUP BY `student_number` HAVING COUNT(`id`) > 1) duplicate
Next, you write the OUTER query, and specify the match fields:
SELECT `first_name`, `last_name`, students.student_number FROM `students`
INNER JOIN __________ ON students.student_number = duplicate.student_number
Finally, you drop in the INNER query into the _____ in the OUTER query, leaving you with:
SELECT `first_name`, `last_name`, students.student_number FROM `students`
INNER JOINÂ (SELECT `student_number` FROM `students` GROUP BY `student_number` HAVING COUNT(`id`) > 1) duplicate ON students.student_number = duplicate.student_number
The OUTER query contain the following elements:
* The fields that you want to show when displaying the records (first_name, last_name)
* The field that you want to de-duplicate (student_number)
The INNER query does the actual magic by joining the same table with itself on the student_number field, finding all student_numbers that have more than one ID.
There you have it. Simple, right?