addons
addons copied to clipboard
Some DB tables are using different collation sets
I found this while playing around with the blocklist tables on Redash. Certain queries require a collate statement to work and can be very slow for that reason (about 3 mins for fairly simple queries).
Quoth @diox:
we should be consistent, settle on
utf8mb4_unicode_cifor now on everything
┆Issue is synchronized with this Jira Task
~We should have set table collation to utf8mb4_unicode_ci per https://github.com/mozilla/addons/issues/6942.~
~Do we know why many tables reverted back to utf8mb4_general_ci?~
Currently these tables are utf8mb4_unicode_ci.
> SELECT TABLE_CATALOG, TABLE_SCHEMA, TABLE_NAME, TABLE_COLLATION FROM INFORMATION_SCHEMA.TABLES where TABLE_SCHEMA = 'addons_mozilla_org' and TABLE_COLLATION = 'utf8mb4_unicode_ci';
+---------------+--------------------+-------------------------------+--------------------+
| TABLE_CATALOG | TABLE_SCHEMA | TABLE_NAME | TABLE_COLLATION |
+---------------+--------------------+-------------------------------+--------------------+
| def | addons_mozilla_org | blocklist_block | utf8mb4_unicode_ci |
| def | addons_mozilla_org | blocklist_block_submission | utf8mb4_unicode_ci |
| def | addons_mozilla_org | blocklist_blocklistsubmission | utf8mb4_unicode_ci |
| def | addons_mozilla_org | blocklist_kintoimport | utf8mb4_unicode_ci |
| def | addons_mozilla_org | django_migrations | utf8mb4_unicode_ci |
| def | addons_mozilla_org | git_gitextractionentry | utf8mb4_unicode_ci |
| def | addons_mozilla_org | hero_primaryheroimage | utf8mb4_unicode_ci |
| def | addons_mozilla_org | log_activity_block | utf8mb4_unicode_ci |
| def | addons_mozilla_org | scanners_query_results | utf8mb4_unicode_ci |
| def | addons_mozilla_org | scanners_query_rules | utf8mb4_unicode_ci |
| def | addons_mozilla_org | scanners_rules | utf8mb4_unicode_ci |
| def | addons_mozilla_org | scanners_scannermatch | utf8mb4_unicode_ci |
| def | addons_mozilla_org | scanners_scannerquerymatch | utf8mb4_unicode_ci |
| def | addons_mozilla_org | versions_versionreviewerflags | utf8mb4_unicode_ci |
+---------------+--------------------+-------------------------------+--------------------+
14 rows in set (0.001 sec)
I verified this is the same across -dev, -stage and -prod.
Oh, I see, in https://github.com/mozilla/addons/issues/6942 it was meant to only change the database collation, not tables.
https://bugzilla.mozilla.org/show_bug.cgi?id=1479111 mentions both collations and it looks like we had a problem switching existing tables to utf8mb4_unicode_ci:
mysql> ALTER TABLE addons CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci, LOCK=SHARED; ERROR 1062 (23000): Duplicate entry 'frame-is-transparent' for key 'slug' mysql> ALTER TABLE addons CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci, LOCK=SHARED; Query OK, 842247 rows affected (1 min 14.49 sec) Records: 842247 Duplicates: 0 Warnings: 0
I had forgotten about this... I thought we had "fixed" the data but apparently not.
A summary of the mess we're in:
- Most of the old tables,
addonsin particular (alsousersbecause ofusername, but that column is less important nowadays and we could work around it), are usingutf8mb4_general_ci. It's the default legacy collation in MySQL, it's a fairly "dumb" collation. - The new tables are using
utf8mb4_unicode_ci. - Under
utf8mb4_unicode_cihowever, some slugs (which have to be unique) that are considered different withutf8mb4_general_ciare considered equal - because that collation supports more replacements - It's worth pointing out that MySQL 8.0 (which we aren't using) supports even more collations and has a different default,
utf8mb4_0900_ai_ciwhich supports more recent unicode versions. Presumably that'd mean even more slugs would be considered duplicate with that collation. - We haven't yet figured out a good way to identify which slugs are problematic and what to do with them - and that's why the old tables have stayed on the old collation. On stage, we have roughly 2.6k slugs with non ascii characters (
SELECT COUNT(*) FROM addons WHERE slug NOT REGEXP '^[A-Za-z0-9\.,@&\(\) \-_]*$';), on prod 13k.
We now have a query that can identify problematic slugs:
SELECT slug, count(*) AS cnt FROM addons WHERE slug IS NOT NULL GROUP BY slug collate utf8mb4_0900_ai_ci HAVING cnt > 1;
(this is the version using utf8mb4_0900_ai_ci on MySQL 8.0, which we are not using yet)
Using this query we found 13 problematic duplicates that we need to handle to move forward.
It can also be used for usernames, though we can leave it aside for now (there are lot more problematic usernames).
This issue has been automatically marked as stale because it has not had recent activity. If you think this bug should stay open, please comment on the issue with further details. Thank you for your contributions.
Old Jira Ticket: https://mozilla-hub.atlassian.net/browse/ADDSRV-42
We eventually switched to MySQL 8.0 (in https://github.com/mozilla/addons/issues/8043 and https://bugzilla.mozilla.org/show_bug.cgi?id=1666431) and that means that we should revisit default collations again.
https://github.com/mozilla/addons/issues/15032 is an example of something that can fail or not depending on the collation used, so it's important to harmonize across our environments. Ideally we should be using utf8mb4_0900_ai_ci everywhere unless there is a good reason not to on specific tables/columns.
Also note that new local/CI environments are already different from dev/stage/prod, since they should already be using utf8mb4_0900_ai_ci, where as dev/stage/prod are still on a mix of utf8mb4_unicode_ci/utf8mb4_general_ci.