addons Some DB tables are using different collation sets

I found this while playing around with the blocklist tables on Redash. Certain queries require a collate statement to work and can be very slow for that reason (about 3 mins for fairly simple queries).

Quoth @diox:

we should be consistent, settle on utf8mb4_unicode_ci for now on everything

┆Issue is synchronized with this Jira Task

Jul 07 '20 21:07 jvillalobos

~We should have set table collation to utf8mb4_unicode_ci per https://github.com/mozilla/addons/issues/6942.~

~Do we know why many tables reverted back to utf8mb4_general_ci?~

Currently these tables are utf8mb4_unicode_ci.

> SELECT TABLE_CATALOG, TABLE_SCHEMA, TABLE_NAME, TABLE_COLLATION FROM INFORMATION_SCHEMA.TABLES where TABLE_SCHEMA = 'addons_mozilla_org' and TABLE_COLLATION = 'utf8mb4_unicode_ci';
+---------------+--------------------+-------------------------------+--------------------+
| TABLE_CATALOG | TABLE_SCHEMA       | TABLE_NAME                    | TABLE_COLLATION    |
+---------------+--------------------+-------------------------------+--------------------+
| def           | addons_mozilla_org | blocklist_block               | utf8mb4_unicode_ci |
| def           | addons_mozilla_org | blocklist_block_submission    | utf8mb4_unicode_ci |
| def           | addons_mozilla_org | blocklist_blocklistsubmission | utf8mb4_unicode_ci |
| def           | addons_mozilla_org | blocklist_kintoimport         | utf8mb4_unicode_ci |
| def           | addons_mozilla_org | django_migrations             | utf8mb4_unicode_ci |
| def           | addons_mozilla_org | git_gitextractionentry        | utf8mb4_unicode_ci |
| def           | addons_mozilla_org | hero_primaryheroimage         | utf8mb4_unicode_ci |
| def           | addons_mozilla_org | log_activity_block            | utf8mb4_unicode_ci |
| def           | addons_mozilla_org | scanners_query_results        | utf8mb4_unicode_ci |
| def           | addons_mozilla_org | scanners_query_rules          | utf8mb4_unicode_ci |
| def           | addons_mozilla_org | scanners_rules                | utf8mb4_unicode_ci |
| def           | addons_mozilla_org | scanners_scannermatch         | utf8mb4_unicode_ci |
| def           | addons_mozilla_org | scanners_scannerquerymatch    | utf8mb4_unicode_ci |
| def           | addons_mozilla_org | versions_versionreviewerflags | utf8mb4_unicode_ci |
+---------------+--------------------+-------------------------------+--------------------+
14 rows in set (0.001 sec)

I verified this is the same across -dev, -stage and -prod.

Jul 08 '20 01:07 bqbn

Oh, I see, in https://github.com/mozilla/addons/issues/6942 it was meant to only change the database collation, not tables.

Jul 08 '20 01:07 bqbn

https://bugzilla.mozilla.org/show_bug.cgi?id=1479111 mentions both collations and it looks like we had a problem switching existing tables to utf8mb4_unicode_ci:

mysql> ALTER TABLE addons CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci, LOCK=SHARED; ERROR 1062 (23000): Duplicate entry 'ｆrame-is-transparent' for key 'slug' mysql> ALTER TABLE addons CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci, LOCK=SHARED; Query OK, 842247 rows affected (1 min 14.49 sec) Records: 842247 Duplicates: 0 Warnings: 0

I had forgotten about this... I thought we had "fixed" the data but apparently not.

Jul 08 '20 09:07 diox

A summary of the mess we're in:

Most of the old tables, addons in particular (also users because of username, but that column is less important nowadays and we could work around it), are using utf8mb4_general_ci. It's the default legacy collation in MySQL, it's a fairly "dumb" collation.
The new tables are using utf8mb4_unicode_ci.
Under utf8mb4_unicode_ci however, some slugs (which have to be unique) that are considered different with utf8mb4_general_ci are considered equal - because that collation supports more replacements
It's worth pointing out that MySQL 8.0 (which we aren't using) supports even more collations and has a different default, utf8mb4_0900_ai_ci which supports more recent unicode versions. Presumably that'd mean even more slugs would be considered duplicate with that collation.
We haven't yet figured out a good way to identify which slugs are problematic and what to do with them - and that's why the old tables have stayed on the old collation. On stage, we have roughly 2.6k slugs with non ascii characters (SELECT COUNT(*) FROM addons WHERE slug NOT REGEXP '^[A-Za-z0-9\.,@& \-_]*$';), on prod 13k.

Sep 08 '20 11:09 diox

We now have a query that can identify problematic slugs: SELECT slug, count(*) AS cnt FROM addons WHERE slug IS NOT NULL GROUP BY slug collate utf8mb4_0900_ai_ci HAVING cnt > 1;

(this is the version using utf8mb4_0900_ai_ci on MySQL 8.0, which we are not using yet)

Using this query we found 13 problematic duplicates that we need to handle to move forward.

It can also be used for usernames, though we can leave it aside for now (there are lot more problematic usernames).

Sep 18 '20 10:09 diox

This issue has been automatically marked as stale because it has not had recent activity. If you think this bug should stay open, please comment on the issue with further details. Thank you for your contributions.

Mar 19 '21 23:03 stale[bot]

Old Jira Ticket: https://mozilla-hub.atlassian.net/browse/ADDSRV-42

May 03 '24 16:05 KevinMind

We eventually switched to MySQL 8.0 (in https://github.com/mozilla/addons/issues/8043 and https://bugzilla.mozilla.org/show_bug.cgi?id=1666431) and that means that we should revisit default collations again.

https://github.com/mozilla/addons/issues/15032 is an example of something that can fail or not depending on the collation used, so it's important to harmonize across our environments. Ideally we should be using utf8mb4_0900_ai_ci everywhere unless there is a good reason not to on specific tables/columns.

Oct 01 '24 16:10 diox

Also note that new local/CI environments are already different from dev/stage/prod, since they should already be using utf8mb4_0900_ai_ci, where as dev/stage/prod are still on a mix of utf8mb4_unicode_ci/utf8mb4_general_ci.

Oct 01 '24 16:10 diox

addons addons copied to clipboard

Some DB tables are using different collation sets

addons
addons copied to clipboard