The Problem
You're walking in the woods of your database
There's no one around, and your laptop battery is (almost) dead
Out of the corner of your eye, you spot them - duplicates
It starts bugging you, "How did they get there?"
You select all the data and scratch your head
But you start to see it's not just a few rows
There are duplicates everywhere!
(Heavily inspired by Rob Cantor's Shia LaBeouf )
You identify 1 or 2 columns where you can sort out those nasty repeated rows of information.
You only want to keep one of them, a NOT NULL
& not blank value.
"Use a PARTITION_BY
!", I hear someone shouting from the imaginary crowd for this horror tale (it was actually Joana).
Except....You're using MySQL. *sigh*
The ONLY_FULL_GROUP_BY flag (side note)
If you're having issues with your current query for a similar problem, know that this little flag is enabled by default since MySQL 5.7.
It won't allow you to use columns in the SELECT
statement that are not included in the GROUP_BY
statement, except if those are functions (like COUNT
, SUM
, AVG
, MAX
, MIN
, etc).
You could disable it and carry on with your life...
mysql > SET GLOBAL sql_mode=(SELECT REPLACE(@@sql_mode, 'ONLY_FULL_GROUP_BY', ''));
But then again, why are you reading this?
The Attempts
In case you don't notice it during the following section, I'll be working with translations and I have repeated pairs of locale
& translated_object_id
- a foreign_key. I only need to keep 1 valid translation.
Any_Value
SELECT ANY_VALUE(id) id, translated_object_id, locale
FROM translated_objects
GROUP BY translated_object_id, locale
HAVING COUNT(*) > 1;
My first attempt was invalid.
I searched about the ONLY_FULL_GROUP_BY flag as soon as I ran my query when I initially thought I could just get the ID field I needed as long as I grouped and counted the resulting rows.
As a good programmer, I turned to StackOverflow and promptly realised my mistake, so I added the ANY_VALUE
function because I was "sure that all the results inside the group are the same".
Yeah right...
Self-Not-Null-Left-Join
SELECT t0.id
FROM translated_objects as t0
LEFT JOIN translated_objects AS t1 ON t0.translated_object_id = t1.translated_object_id AND t0.locale = t1.locale AND t1.id > t0.id
WHERE t1.translated_object_id IS NOT NULL;
The next chapter was basically a re-iteration of the previous solution, less fancier but it was still getting the job done.
I was simply looking for an alternative to the GROUP_BY
query after my first failure, and when I saw that my previous solution did not provide all the results I needed, I had to give this query a shot.
Still, after digging further and finally understanding my problem, I knew this path would never lead me to shore, since I had 4 or 5 duplicated lines per GROUP_BY
and these queries only returned 1 of them.
I knew what I needed & I couldn't avoid it any longer. (Joana was right...)
Partition By w/ 1 column (explanation)
SELECT *
FROM translated_objects
WHERE id IN (
SELECT id
FROM
(
SELECT @row_number := IF(@previous_value=o.translated_object_id,@row_number+1,1) AS RowNumber
,o.id
,o.translated_object_id
,o.locale
,@previous_value := o.translated_object_id
FROM translated_objects o,
(SELECT @row_number := 1) x,
(SELECT @previous_value := '') y
ORDER BY o.translated_object_id, o.updated_at DESC
) subquery
WHERE RowNumber > 1
);
If you checked the previous queries, you will notice this wasn't actually a solution to my problem since I need to take into consideration 2 columns. For completeness sake, and for posterity, I'll leave a partition query with only 1 column.
MySQL does not have a ROW_NUMBER
function like other database softwares do, which allows to differentiate between grouped results by conveniently assigning them - you guessed it - a row_number
. However, it does have the ability to use variables relatively freely, so I only need to create my own.
Furthermore, I also need to keep track of the partition so I know when to reset the row_number
. That's why I use the previous_value
, carefully checking when my foreign key changes with the IF
statement you see on the subquery.
Enough chit-chat. Let's move on!
Partition By w/ 2 columns (or more...)
SELECT *
FROM translated_objects
WHERE id IN (
SELECT id
FROM
(
SELECT @row_number := IF(@previous_value=concat_ws('',o.translated_object_id, o.locale),@row_number+1,1) AS RowNumber
,o.id
,o.translated_object_id
,o.locale
,@previous_value := concat_ws('',o.translated_object_id, o.locale)
FROM translated_objects o,
(SELECT @row_number := 1) x,
(SELECT @previous_value := '') y
ORDER BY o.translated_object_id, o.locale, (CASE WHEN o.name IS NULL THEN 0 ELSE 1 END) DESC, o.updated_at ASC
) subquery
WHERE RowNumber > 1
);
Sweet Christmas! Finally got it!
Like stated before, I needed to PARTITION BY
2 columns - my foreign_key translated_object_id
and locale
- so I have to CONCAT
these values.
I'm using CONCAT_WS
here since unlike your regular brand of concatenation, this method will actually skip any NULL values and return the resulting (expected) string. This is now my main control value, so it updates my row_number
and previous_value
accordingly.
Finally, since I want to ORDER BY
the latest NOT NULL
value, I added an explicit CASE
to the statement.
Query Optimization
SELECT *
FROM translated_objects
WHERE id IN (
SELECT id
FROM (
SELECT @row_number := IF(@previous_value = CONCAT_WS('', o.translated_object_id, o.locale), @row_number + 1 ,1) AS RowNumber
,o.id
,@previous_value := CONCAT_WS('', o.translated_object_id, o.locale)
FROM (
SELECT translated_objects.id, translated_objects.translated_object_id, translated_objects.locale, translated_objects.name, translated_objects.updated_at
FROM translated_objects, (
SELECT translated_object_id, locale, COUNT(*) AS count
FROM translated_objects
GROUP BY translated_object_id, locale
HAVING COUNT(*) > 1
) AS repeated
WHERE translated_objects.translated_object_id = repeated.translated_object_id
AND translated_objects.locale = repeated.locale
) o,
(SELECT @row_number := 1) x,
(SELECT @previous_value := '') y
ORDER BY o.translated_object_id, o.locale, (CASE WHEN o.name IS NULL OR translated_objects.name = '' THEN 0 ELSE 1 END) DESC, o.updated_at ASC
) subquery
WHERE RowNumber > 1
);
You might not notice it, but the CASE
in the ORDER BY
line was updated. If, like me, you had a lot of empty non-null lines in your rows, you are going to need to filter those out too if you really want to obtain the last valid result.
However, "empty" might be a valid result for you, so I decided to only include it after for the article's purpose.
Now, when working with large tables, I was still waiting ~5 minutes before I could get any result. I was accessing the whole table with FROM translated_objects o
, which probably loads the dataset to memory to work on it when the partitions extend for a large number of rows.
Therefore, to reduce the amount of data needed, search no further then the very first attempt in the article, with a small twist (thanks Filipe!).
SELECT translated_object_id, locale, COUNT(*) AS count
FROM translated_objects
GROUP BY translated_object_id,
HAVING COUNT(*) > 1
Instead of fetching ANY_VALUE
from the table, simply count the number of results from the GROUP BY
and return the parsed set of data. Furthermore, simply select the columns you absolutely need in order to work with the minimum amount of data required.
Indexes Optimization
CREATE INDEX idx ON translated_objects (translated_object_id, locale);
In order to accelerate the GROUP BY
subsubsubsubquery, add an index on both columns from GROUP BY translated_object_id, locale
to make it even faster.
Final note
CREATE UNIQUE INDEX idx ON translated_objects (translated_object_id, locale);
Add an unique index on both columns from GROUP BY translated_object_id, locale
after to avoid any future errors.
You're finally safe...from those nasty duplicates.
Sources
https://preilly.me/2011/11/11/mysql-row_number/
https://stackoverflow.com/questions/1895110/row-number-in-mysql/8177307#8177307