SMF icon indicating copy to clipboard operation
SMF copied to clipboard

Mysql mb4

Open sbulen opened this issue 3 years ago • 44 comments

Partial for https://github.com/SimpleMachines/SMF/issues/7173 Fixes #6405 Fixes #7776 Fixes #5108

Overview...
Mysql charsets supported should be utf8 (for 2.1 users migrated before utf8mb4 support) and utf8mb4. The only difference between the two is entity encoding needed for 4-byte utf8 chars when not yet utf8mb4. pg supports mb4 already.

All conditional utf8 logic (utf8 vs iso-8859) is removed.

Current thinking:

  • Install/upgrade to utf8mb4
  • Needs new 'convert to mb4', like 2.0 had with utf8
  • Yes, db changes are necessary, index size constraints (*** Can be avoided if we have change min mysql version to 5.7 & mandate InnoDB...***)
  • Stripped non-utf8/utf8mb4 logic wherever found; assume utf8/utf8mb4 ONLY going forward in 2.1; Making it easier to refactor down the road

191...

  • Where does 191 come from? (int) 767/4... Index size max varies based on innodb vs myisam & the innodb row format, which changed over time... Some configs/versions, the max is 767, for some, it's 1000, & for the rest is 3072. 767 is a safe assumption.
  • To do a good a/b test, you should emulate a prefix length of 767. You can do this by using innodb tables & changing innodb_default_row_format to 'compact', which was the default prior to mysql 5.7. (Or, use MyISAM, which has a limit of 1000.)

Trying to keep field meanings straight...

  • global_character_set - used in http headers "UTF-8"; had been used as a flag whether utf8 conversion had been run (no change)
  • $txt['lang_character_set'] - used in http headers "UTF-8" (no change)
  • $context['character_set'] = derived from the above 2 settings, used in http headers (no change)
  • $db_character_set = "utf8" or "utf8mb4" (requires changes throughout...)
  • The above 4 really need to be in sync...
  • $utf8 & $context['utf8'] = flags used throughout code to determine utf8/utf8mb4 string support - most often used to add u modifier to regexes, derived from above (NO LONGER NEEDED requires changes throughout...)
  • $db_mb4 = flag for whether mb4 is supported; set after utf8mb4 conversion (or pg in use)... Only used in entity conversion. This really should derived in load.php, not set by user.

Read up on index prefixes here: https://dev.mysql.com/doc/refman/8.0/en/column-indexes.html https://dev.mysql.com/doc/refman/8.0/en/innodb-limits.html#:~:text=The%20index%20key%20prefix%20length%20limit%20is%20767%20bytes%20for,4%20bytes%20for%20each%20character.

mb4 support history: https://mysqlserverteam.com/mysql-8-0-when-to-use-utf8mb3-over-utf8mb4/

Mysql usage & support by version: https://www.eversql.com/mysql-8-adoption-usage-rate/ https://fromdual.com/support-for-mysql-from-oracle

sbulen avatar Dec 14 '20 17:12 sbulen

In my eyes would be this a huge uplift to redurce the complexity, by force everthing to be utf8mb4 AND to raise the min mysql version to 5.5.

So I like the pr.

albertlast avatar Dec 15 '20 10:12 albertlast

I see that in your changes are using utf8mb4_general_ci and I would like to propose using utf8mb4_unicode_ci. "general" does not use the unicode rules for sorting and may have unexpected results when used with various languages.

The added performance of using general over unicode makes little difference with modern CPU & memory power/size.

Underdog-01 avatar Jan 18 '21 02:01 Underdog-01

Also text fields may be affected in some circumstances. They may require being changed to their upper tiered counterpart. (ie. text to mediumtext)

Underdog-01 avatar Jan 18 '21 02:01 Underdog-01

Also text fields may be affected in some circumstances. They may require being changed to their upper tiered counterpart. (ie. text to mediumtext)

Pretty sure we're good here... The settings are now by characters, not bytes.

The problems experienced db-structure-wise were the index lengths...

sbulen avatar Jan 18 '21 03:01 sbulen

I see that in your changes are using utf8mb4_general_ci and I would like to propose using utf8mb4_unicode_ci. "general" does not use the unicode rules for sorting and may have unexpected results when used with various languages.

The added performance of using general over unicode makes little difference with modern CPU & memory power/size.

I did some tests & couldn't find an example where there was a difference.

Can you identify a scenario?

I understand the benefit in theory, just not sure it's really there...

sbulen avatar Jan 18 '21 03:01 sbulen

& THANK YOU for the feedback. I think this is an important topic.

sbulen avatar Jan 18 '21 03:01 sbulen

It's data sorting for various languages that will present obstacles when using general. A brief explanation with some examples can be found on the MySQL forum: https://forums.mysql.com/read.php?103,187048,188748

Underdog-01 avatar Jan 18 '21 04:01 Underdog-01

BTW I agree that this is a very important topic. The MySQL website states that utf8mb3 is deprecated and in the very near future will be removed Look at the first "Note" here: https://dev.mysql.com/doc/refman/8.0/en/charset-unicode-utf8mb3.html

Underdog-01 avatar Jan 18 '21 04:01 Underdog-01

As per my other comments on this. This is a big change in how we handle the database layer. As such its too late in 2.1 to make this. So this should not make 2.1 and be closed out. Future versions of SMF we should make this from the get go with support. Of course this means that future versions of SMF will need to support the conversion to mb4 either manually or during the upgrade process.

jdarwood007 avatar Mar 21 '21 00:03 jdarwood007

Why can't this repo contin future versions?

On Sat, Mar 20, 2021 at 5:46 PM Jeremy D @.***> wrote:

As per my other comments on this. This is a big change in how we handle the database layer. As such its too late in 2.1 to make this. So this should not make 2.1 and be closed out. Future versions of SMF we should make this from the get go with support. Of course this means that future versions of SMF will need to support the conversion to mb4 either manually or during the upgrade process.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/SimpleMachines/SMF2.1/pull/6409#issuecomment-803492220, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADJNN4NBBOBOZGHKW5ZPMLTEU6XFANCNFSM4U3DTNPA .

live627 avatar Mar 21 '21 05:03 live627

in my eyes smf 2.1 should run only in mb4 mode and should drop mb3 support, since it get again mysql and pg similiar and you don't had the struggle with newer mysql version where mb4 get the new default coalation when you define a column as utf8.

albertlast avatar Mar 21 '21 07:03 albertlast

As per my other comments on this. This is a big change in how we handle the database layer. As such its too late in 2.1 to make this. So this should not make 2.1 and be closed out. Future versions of SMF we should make this from the get go with support. Of course this means that future versions of SMF will need to support the conversion to mb4 either manually or during the upgrade process.

I disagree with you on this. SMF2.1 should be mb4, IMO. But we've waited too long & we are appropriately maintaining scope. This is appropriate for 2.1.x.

There is nothing different at all in how we handle the db layer, no changes in DB calls at all. This PR mirrors how we dealt with mb3 in 2.0.

This PR is about 10% fixing bugs, 70% cleaning up old latin1 logic that should have already been removed from 2.1, and 20% migrating to mb4. But we agreed to wait after 2.1, so we are.

sbulen avatar Mar 22 '21 04:03 sbulen

Well I do disagree with that because of all the alterations to the core functionality of code. But I will stand by my statement made on others. This has to make RC4 or its not making 2.1. We can't introduce this between RC4 and final.

jdarwood007 avatar Mar 22 '21 23:03 jdarwood007

Fyi, this is also showing as a draft PR.

jdarwood007 avatar Mar 22 '21 23:03 jdarwood007

It's marked as a draft because the dev team had agreed to wait until after 2.1 goes live for mb4 support. I want to ensure it doesnt get merged prematurely.

sbulen avatar Mar 23 '21 01:03 sbulen

I brought this current, in case folks wanted to test. I incorporated most feedback above. This required a rebase due to many layers of conflicts.

I have not yet responded to @live627's feedback on changes interfering with postgresql. In my testing, it hasn't - this works fine in pg as-is. Still trying to understand why, though...

sbulen avatar Sep 24 '21 02:09 sbulen

I brought this current, in case folks wanted to test. I incorporated most feedback above. This required a rebase due to many layers of conflicts.

I have not yet responded to @live627's feedback on changes interfering with postgresql. In my testing, it hasn't - this works fine in pg as-is. Still trying to understand why, though...

Exists any issue with pg?

albertlast avatar Sep 24 '21 03:09 albertlast

I haven't seen any issues with pg yet. Still early in testing, though.

sbulen avatar Sep 24 '21 03:09 sbulen

From my pov the whole topic didn't exists in pg.

albertlast avatar Sep 24 '21 04:09 albertlast

Whst is with "Convert HTML-entities to UTF-8 characters" is part of the migration or had the user run this by hisself after the mb4?

albertlast avatar Sep 25 '21 04:09 albertlast

Whst is with "Convert HTML-entities to UTF-8 characters" is part of the migration or had the user run this by hisself after the mb4?

It's not part of the migration. It still needs to be run after the upgrade - at least once - for the real characters to be used.

sbulen avatar Sep 25 '21 19:09 sbulen

From where they know, that they had to doit?

albertlast avatar Sep 25 '21 20:09 albertlast

From where they know, that they had to doit?

Good point. (Just like today...) This would be a good thing to add to the online wiki documentation. lt's already in there for the 2.0 utf8 conversion process, but not for 2.1 migrations, which do one for you.

sbulen avatar Sep 25 '21 21:09 sbulen

Brought this one current again.

sbulen avatar Jan 25 '22 03:01 sbulen

Brought this one current again.

sbulen avatar Feb 27 '22 01:02 sbulen

Brought current again!

sbulen avatar Jul 09 '23 04:07 sbulen

I believe I have addressed all outstanding requests. If not, let me know...

sbulen avatar Jul 09 '23 04:07 sbulen

Smf.org would be a nice place 😀

albertlast avatar Jul 09 '23 16:07 albertlast

I did a ton of testing, but like, 2.5 years ago...

It needs to be repeated with the current codebase.

sbulen avatar Jul 09 '23 16:07 sbulen

@albertlast Of course we will, once we feel its stable enough for a production environment. We eat our own food.

jdarwood007 avatar Jul 09 '23 17:07 jdarwood007