dbal
dbal copied to clipboard
Fix: #6146 - utf8/utf8mb3 alias causes redundant ALTER TABLE changes
| Q | A |
|---|---|
| Type | bug |
| Fixed issues | #6146 |
Summary
In MySQL (and MariaDB), utf8 and utf8mb3 are aliases. Depending on the database server version, the information schema may contain one or the other. Similarly, column definitions may contain one or the other.
When performing schema-comparisons, a mismatch between utf8 and utf8mb3 causes an ALTER TABLE statement to be generated - even when columns are identical.
The code already contains a function to normalize the charset/collation options: https://github.com/doctrine/dbal/blob/61d79c6e379a39dc1fea6b4e50a23dfc3cd2076a/src/Platforms/MySQL/Comparator.php#L83
My solution is to extend this function so that it also replaces utf8 with utf8mb3 (or vice-versa), to match the current database connection.
To detect the prefered character set for the connection, I've added:
AbstractMySQLPlatform::informationSchemaUsesUtf8mb3()
MariaDBPlatform::informationSchemaUsesUtf8mb3()
The obvious place for the new normalization logic is the CharsetMetadataProvider and CollationMetadataProvider classes. These are interfaces, so this is technically a BC break, however, they are marked @internal.
I've added tests which should cover all the new/modified code. I've run it against MySQL. I don't have MariaDB to hand.
Just pushed a change to fix the psalm issues
It seems there are CI jobs failing. Please take a look at this guide for more on how to handle those.
ಠ_ಠ
I've added tests which should cover all the new/modified code. I've run it against MySQL. I don't have MariaDB to hand.
Codecov says that some of the new code is untested, most notably the parts using version_compare. Please add functional tests for this change. As you can see, our test suite features, among other platforms MariaDB 10.5 and MariaDB 10.6, so you should be able to test that schema comparisons on either version of MariaDB indeed doesn't cause ALTER TABLE statements to be generated. Same goes for MySQL.
You can learn more about such tests at https://www.doctrine-project.org/projects/doctrine-dbal/en/stable/reference/testing.html
Codecov says that some of the new code is untested, most notably the parts using version_compare. Please add functional tests for this change.
Running the tests locally says the code is covered. (Ad hoc testing confirms this).
XDEBUG_MODE=coverage vendor/bin/phpunit -c ci/github/phpunit/pdo_mysql.xml --coverage-html=coverage
However, Codecov says it isn't.
I think I found the reason: [2024-05-01T09:48:28.648Z] ['error'] There was an error running the uploader: Error uploading to [https://codecov.io:](https://codecov.io/) Error: There was an error fetching the storage URL during POST: 429 - {'detail': ErrorDetail(string='Rate limit reached. Please upload with the Codecov repository upload token to resolve issue. Expected available in 1935 seconds.', code='throttled')}
Let me retry this.
From the Codecov UI:
Coverage data is based on head 9c87c57(2 uploads) compared to base 7fb00fd(5 uploads)
… we currently send 69 uploads, so there is no way we are going to be below the upload limit. I will have to look into this separately.
Ah there is one issue though: since this is a fix, you should target 3.8.x
Ah there is one issue though: since this is a fix, you should target 3.8.x
I can create another PR for this branch.
Please don't, we usually do merges up. Instead, let me retarget your PR, and please rebase your commits on 3.8.x and force push.
Please don't, we usually do merges up. Instead, let me retarget your PR, and please rebase your commits on 3.8.x and force push.
The PR uses Connection::getServerVersion() to select the prefered alias for utf8/utf8mb3.
In 3.8.x, this method isn't available, so I'll need to find another approach.
I'm going to be offline for the next week or so, and will look at this on my return.
Enjoy your time off 👍
@fisharebest do you need help with the rebase?
I have rewritten my patch to use the information_schema instead of server version numbers. This seems more robust, and it should also continue to work if MySQL remove utf8mb3 in version 9.
However, I have been unable to back-port this PR to the 3.8.x branch. (I can, but it doesn't fix the bug...).
The handling of charset/collation is different. I'd probably need to copy lots of other changes from the 4.0.x branch to make it work.
Options are:
- apply this patch to
4.0.xonly - attempt to backport the changes in charset/collation handling from
4.0.xto3.8.xand then try again - ????
I'd be surprised if there were lots of changes you could backport that are not breaking changes, so I'd say we should consider applying this to 4.0.x although given the usage statistics of 4.0.x, it won't have much effect at first.
I have rewritten my patch to use the information_schema instead of server version numbers. This seems more robust, and it should also continue to work if MySQL remove utf8mb3 in version 9.
As far as I understand this, utf8 would stay utf8 once utf8mb3 isn't supported anymore. I didn't find information about when this alias handling was introduced on a short search though.
it should also continue to work if MySQL remove utf8mb3 in version 9.
Now that MySQL 9.0 is released, we can see that the logic still works, and the tests pass.
Can you please rebase onto 4.1? That should trigger a CI pipeline that includes MySQL 9.
All the tests are passing. Looks like a missing token in the CI tools.
You can ignore the Codecov failure, that issue is tracked here: https://github.com/codecov/codecov-action/issues/1548
Yes sorry, we lost track of this PR, apparently. I'll do a review of the PR.
@fisharebest Will you have time to pick up the work on this PR? From what I gathered, we don't need the detection logic that you've built at all.
The charsets utf8 and utf8mb3 are the same charsets in MySQL and most likely will be for eternity. MySQL considers one of the identifiers as canonical and the other one as alias. No matter which of the both identifiers we use when creating a table, MySQL will always report the canonical name.
- In MySQL 5.7 and below,
utf8is the canonical name andutf8mb3the alias. - In MySQL 8.0 and above,
utf8mb3is the canonical name andutf8the alias.
This means, that if we normalize both schemas to let's say the long version utf8mb3, the comparison should always succeed.
Note that MySQL 5.7 support is deprecated by now, so I would really like to keep logic that works around behavioral changes of MySQL 8.0 as simple as possible.
- In MySQL 5.7 and below, utf8 is the canonical name and utf8mb3 the alias.
- In MySQL 8.0 and above, utf8mb3 is the canonical name and utf8 the alias.
Not quite. The change in canonical/alias happened between 8.0.29 and 8.0.30. See https://dev.mysql.com/doc/refman/8.0/en/charset-unicode-utf8mb3.html
For MariaDB it is a little more complicated. The same change was made in 10.6.1, but a run-time variable (old_mode) allows the previous canonical/alias to be used. This setting still exists on the latest version of MariaDB.
See https://mariadb.com/kb/en/old-mode/#utf8_is_utf8mb3
This means, that if we normalize both schemas to let's say the long version utf8mb3, the comparison should always succeed.
Yes, but there is still the small issue of DBAL looking up default collations for charsets. See the second part of my comment at https://github.com/doctrine/dbal/issues/6146#issuecomment-2067784678
We'd need to normalise the result of these queries too.
Not quite. The change in canonical/alias happened between 8.0.29 and 8.0.30. See https://dev.mysql.com/doc/refman/8.0/en/charset-unicode-utf8mb3.html
Oh right. MySQL's versioning policy during 8.0.x times was wild.
For MariaDB it is a little more complicated. The same change was made in 10.6.1, but a run-time variable (
old_mode) allows the previous canonical/alias to be used. This setting still exists on the latest version of MariaDB. See https://mariadb.com/kb/en/old-mode/#utf8_is_utf8mb3
WTF. Thank you very much for your thorough research. So, we could set that variable before we run our queries and reset it afterwards, right?
This means, that if we normalize both schemas to let's say the long version utf8mb3, the comparison should always succeed.
Yes, but there is still the small issue of DBAL looking up default collations for charsets. See the second part of my comment at #6146 (comment)
We'd need to normalise the result of these queries too.
Fine by me. But again, if we do all this normalization, we don't need run any detection logic on the server because we enable DBAL to work with MySQL/MariaDB responsing with old and new character set names (and corresponsing collation names), correct?
There hasn't been any activity on this pull request in the past 90 days, so it has been marked as stale and it will be closed automatically if no further activity occurs in the next 7 days. If you want to continue working on it, please leave a comment.
This pull request was closed due to inactivity.