The Problem

You're walking in the woods of your database
There's no one around, and your laptop battery is (almost) dead
Out of the corner of your eye, you spot them - duplicates
It starts bugging you, "How did they get there?"
You select all the data and scratch your head
But you start to see it's not just a few rows
There are duplicates everywhere!

(Heavily inspired by Rob Cantor's Shia LaBeouf )

You identify 1 or 2 columns where you can sort out those nasty repeated rows of information.

You only want to keep one of them, a NOT NULL & not blank value.

"Use a PARTITION_BY!", I hear someone shouting from the imaginary crowd for this horror tale (it was actually Joana).

Except....You're using MySQL. *sigh*

The ONLY_FULL_GROUP_BY flag (side note)

If you're having issues with your current query for a similar problem, know that this little flag is enabled by default since MySQL 5.7.

It won't allow you to use columns in the SELECT statement that are not included in the GROUP_BY statement, except if those are functions (like COUNT, SUM, AVG, MAX, MIN, etc).

You could disable it and carry on with your life...

mysql > SET GLOBAL sql_mode=(SELECT REPLACE(@@sql_mode, 'ONLY_FULL_GROUP_BY', ''));

But then again, why are you reading this?

The Attempts

In case you don't notice it during the following section, I'll be working with translations and I have repeated pairs of locale & translated_object_id - a foreign_key. I only need to keep 1 valid translation.

Any_Value

SELECT ANY_VALUE(id) id, translated_object_id, locale
FROM translated_objects
GROUP BY translated_object_id, locale
HAVING COUNT(*) > 1;

My first attempt was invalid.

I searched about the ONLY_FULL_GROUP_BY flag as soon as I ran my query when I initially thought I could just get the ID field I needed as long as I grouped and counted the resulting rows.

As a good programmer, I turned to StackOverflow and promptly realised my mistake, so I added the ANY_VALUE function because I was "sure that all the results inside the group are the same".

Yeah right...

Self-Not-Null-Left-Join

SELECT t0.id
FROM translated_objects as t0
LEFT JOIN translated_objects AS t1 ON t0.translated_object_id = t1.translated_object_id AND t0.locale = t1.locale AND t1.id > t0.id
WHERE t1.translated_object_id IS NOT NULL;

The next chapter was basically a re-iteration of the previous solution, less fancier but it was still getting the job done.

I was simply looking for an alternative to the GROUP_BY query after my first failure, and when I saw that my previous solution did not provide all the results I needed, I had to give this query a shot.

Still, after digging further and finally understanding my problem, I knew this path would never lead me to shore, since I had 4 or 5 duplicated lines per GROUP_BY and these queries only returned 1 of them.

I knew what I needed & I couldn't avoid it any longer. (Joana was right...)

Partition By w/ 1 column (explanation)

SELECT *  
FROM translated_objects  
WHERE id IN (
  SELECT id
  FROM
    (
      SELECT  @row_number := IF(@previous_value=o.translated_object_id,@row_number+1,1) AS RowNumber
             ,o.id
             ,o.translated_object_id 
             ,o.locale
             ,@previous_value := o.translated_object_id
        FROM translated_objects o,
             (SELECT @row_number := 1) x,
             (SELECT @previous_value := '') y
       ORDER BY o.translated_object_id, o.updated_at DESC
    ) subquery
  WHERE RowNumber > 1
);

If you checked the previous queries, you will notice this wasn't actually a solution to my problem since I need to take into consideration 2 columns. For completeness sake, and for posterity, I'll leave a partition query with only 1 column.

MySQL does not have a ROW_NUMBER function like other database softwares do, which allows to differentiate between grouped results by conveniently assigning them - you guessed it - a row_number. However, it does have the ability to use variables relatively freely, so I only need to create my own.

Furthermore, I also need to keep track of the partition so I know when to reset the row_number. That's why I use the previous_value, carefully checking when my foreign key changes with the IF statement you see on the subquery.

Enough chit-chat. Let's move on!

Partition By w/ 2 columns (or more...)

SELECT *
FROM translated_objects
WHERE id IN (
  SELECT  id
  FROM
     (
      SELECT  @row_number := IF(@previous_value=concat_ws('',o.translated_object_id, o.locale),@row_number+1,1) AS RowNumber
             ,o.id
             ,o.translated_object_id 
             ,o.locale
             ,@previous_value := concat_ws('',o.translated_object_id, o.locale)
        FROM translated_objects o,
             (SELECT @row_number := 1) x,
             (SELECT @previous_value := '') y
       ORDER BY o.translated_object_id, o.locale, (CASE WHEN o.name IS NULL THEN 0 ELSE 1 END) DESC, o.updated_at ASC
     ) subquery
 WHERE RowNumber > 1
 );

Sweet Christmas! Finally got it!

Like stated before, I needed to PARTITION BY 2 columns - my foreign_key translated_object_id and locale - so I have to CONCAT these values.

I'm using CONCAT_WS here since unlike your regular brand of concatenation, this method will actually skip any NULL values and return the resulting (expected) string. This is now my main control value, so it updates my row_number and previous_value accordingly.

Finally, since I want to ORDER BY the latest NOT NULL value, I added an explicit CASE to the statement.

Query Optimization
SELECT *
FROM translated_objects
WHERE id IN (
  SELECT id
  FROM (
    SELECT  @row_number := IF(@previous_value = CONCAT_WS('', o.translated_object_id, o.locale), @row_number + 1 ,1) AS RowNumber
           ,o.id
           ,@previous_value := CONCAT_WS('', o.translated_object_id, o.locale)
    FROM (
          SELECT translated_objects.id, translated_objects.translated_object_id, translated_objects.locale, translated_objects.name, translated_objects.updated_at
          FROM translated_objects, (
            SELECT translated_object_id, locale, COUNT(*) AS count
            FROM translated_objects
            GROUP BY translated_object_id, locale
            HAVING COUNT(*) > 1
          ) AS repeated
          WHERE translated_objects.translated_object_id = repeated.translated_object_id
          AND translated_objects.locale = repeated.locale
    ) o,
           (SELECT @row_number := 1) x,
           (SELECT @previous_value := '') y
    ORDER BY o.translated_object_id, o.locale, (CASE WHEN o.name IS NULL OR translated_objects.name = '' THEN 0 ELSE 1 END) DESC, o.updated_at ASC
    ) subquery
    WHERE RowNumber > 1
);

You might not notice it, but the CASE in the ORDER BY line was updated. If, like me, you had a lot of empty non-null lines in your rows, you are going to need to filter those out too if you really want to obtain the last valid result.

However, "empty" might be a valid result for you, so I decided to only include it after for the article's purpose.

Now, when working with large tables, I was still waiting ~5 minutes before I could get any result. I was accessing the whole table with FROM translated_objects o, which probably loads the dataset to memory to work on it when the partitions extend for a large number of rows.

Therefore, to reduce the amount of data needed, search no further then the very first attempt in the article, with a small twist (thanks Filipe!).

SELECT translated_object_id, locale, COUNT(*) AS count
FROM translated_objects
GROUP BY translated_object_id, 
HAVING COUNT(*) > 1

Instead of fetching ANY_VALUE from the table, simply count the number of results from the GROUP BY and return the parsed set of data. Furthermore, simply select the columns you absolutely need in order to work with the minimum amount of data required.

Indexes Optimization
CREATE INDEX idx ON translated_objects (translated_object_id, locale);

In order to accelerate the GROUP BY subsubsubsubquery, add an index on both columns from GROUP BY translated_object_id, locale to make it even faster.

Final note

CREATE UNIQUE INDEX idx ON translated_objects (translated_object_id, locale);

Add an unique index on both columns from GROUP BY translated_object_id, locale after to avoid any future errors.

You're finally safe...from those nasty duplicates.


Sources

https://stackoverflow.com/questions/34115174/error-related-to-only-full-group-by-when-executing-a-query-in-mysql

https://preilly.me/2011/11/11/mysql-row_number/

https://stackoverflow.com/questions/1895110/row-number-in-mysql/8177307#8177307

https://stackoverflow.com/a/35006359/4341031