jethro-pmm icon indicating copy to clipboard operation
jethro-pmm copied to clipboard

Emojis and 4-byte characters cause Jethro to break

Open jefft opened this issue 3 years ago • 5 comments

Jethro currently breaks when emojis are entered. For instance, if I try to save SMS text containing '😊' into a Person note, it fails and an error is logged:

SQLSTATE[22007]: Invalid datetime format: 1366 Incorrect string value: '\xF0\x9F\x98\x8A' for column `jethro`.`note_comment`.`contents` at row 1
Line 240 of /srv/www/jethro/2.31.1/include/db_object.class.php

USER:       1
REFERER:    https://jethro/?view=_edit_note&note_type=person&noteid=20649
USER_AGENT: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.99 Safari/537.36

REQUEST:
Array
(
    [view] => _edit_note
    [note_type] => person
    [noteid] => 20649
    [update_note_submitted] => 1
    [contents] => All good thanks, looking at getting back very soon
Thanks for reaching out 😊
    [status] => pending
    [assignee] => 1
    [action_date_d] => 24
    [action_date_m] => 1
    [action_date_y] => 2022
)

The problem is that MySQL's 'utf8' encoding isn't real UTF-8, but a 3-byte subset that doesn't support all characters. The correct encoding to use in MySQL is 'utfmb4'. For more information, see here or here.

When I checked my Jethro database I found the database-level encoding was correct:

MariaDB [jethro]> SELECT default_character_set_name FROM information_schema.SCHEMATA                                                                               
    -> WHERE schema_name = "jethro";                                                                                                                               
                                                                                                                                                                   
+----------------------------+                                                                                                                                     
| default_character_set_name |                                                                                                                                     
+----------------------------+                                                                                                                                     
| utf8mb4                    |                                                                                                                                     
+----------------------------+   

but table-level encodings were wrong:

MariaDB [jethro]> SELECT CCSA.character_set_name FROM information_schema.`TABLES` T,        information_schema.`COLLATION_CHARACTER_SET_APP                        
LICABILITY` CCSA WHERE CCSA.collation_name = T.table_collation   AND T.table_schema = "jethro";                                                                    
+--------------------+                                                                                                                                             
| character_set_name |                                                                                                                                             
+--------------------+                                                                                                                                             
| utf8               |                                                                                                                                             
| utf8               |                                                                                                                                             
| utf8               |                                                                                                                                             
| utf8               |                                                                                                                                             
| utf8               | 

Fixing this needs two changes:

  1. The database and table encodings need to be changed from utf8 to utf8mb4.
  2. Jethro needs to initiate connections with utf8mb4 rather than utf8, as is currently hardcoded when Jethro constructs the connection dsn: https://github.com/tbar0970/jethro-pmm/blob/269b7b420ec43b7691d6f27bd9077809eed0f57f/include/jethrodb.php#L53

Correspondingly, the fix requires two parts:

  1. Fix the database encodings. ​I followed this advice to generate appropriate SQL to upgrade all database tables. The result is in upgrades/2022-upgrade-to-2.32.sql
  2. Fix the database DSN connection string; done in include/jethrodb.php.

This requires MySQL 5.5.3 and above (released in 2010). I have adjusted the minimum MySQL version in README.md.

Note: I hope changing character sets isn't opening a can of worms. I've only tested on one (mariadb 10.3.32) instance. There may be more places where utf8 needs changing to utf8mb4. I hope other users can test and confirm that upgrades go smoothly.

jefft avatar Jan 29 '22 04:01 jefft

Hmm, I couldn't reproduce the original problem in my systems. Will dig in further to work out why.

tbar0970 avatar Feb 03 '22 10:02 tbar0970

Nice summary of the issue here: https://make.wordpress.org/core/2015/04/02/the-utf8mb4-upgrade/

Holding off on this for now since on MySQL 5.1 emojis are working fine...

tbar0970 avatar Mar 17 '22 22:03 tbar0970

Hmm, I couldn't reproduce the original problem in my systems. Will dig in further to work out why.

Could it be that you have the sms sanitising option turned on? That would automatically strip out all multi-byte characters... and you might never see the error!

vanoudt avatar Sep 05 '22 01:09 vanoudt

I've updated #606 too match this too!

vanoudt avatar Sep 05 '22 01:09 vanoudt

Hmm, I couldn't reproduce the original problem in my systems. Will dig in further to work out why.

Could it be that you have the sms sanitising option turned on? That would automatically strip out all multi-byte characters... and you might never see the error!

That wouldn't be it - the original problem was triggered by pasting emojis into a Jethro note, not by sending anything out.

tbar0970 avatar Sep 05 '22 04:09 tbar0970

Hey I'm thinking now would be a good time to merge this. Not sure what "closed by deleting the head repository" means or came from?

tbar0970 avatar Mar 07 '23 22:03 tbar0970

Holding off on this for now since on MySQL 5.1 emojis are working fine...

Yes, it was wise to hold off for that reason. The SQL in this PR would corrupt some older databases.

As your experience shows, emojis in MySQL 5.1 do work, if your database encoding is latin1 and you're using ancient PHP (5.3.3) which seems hardcode latin1 as the transfer encoding (ignoring charset=utf8 in include/jethrodb.php). Say we have a latin1 table, and we ask MySQL to insert the 4 utf8 bytes representing 😊. MySQL goes "close enough lol ¯_(ツ)_/¯" and adds the 4 raw bytes, despite them not being valid latin1. When asked to retrieve data, MySQL goes 'here's your bytes whatever they are'. Everything just works because PHP and MySQL store and load the raw bytes unaltered.

Now what happens if you take one of these latin1 databases containing utf8 bytes, and run the SQL I proposed in this PR? E.g.:

ALTER TABLE `family_note` CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;

I suspect that might break your characters. Say you have a '😊' encoded as 4 bytes: MySQL will try to map those 4 utf8 bytes one by one, to whatever unicode characters live at those 3 codepoints.

For latin1 databases, what's needed is a more involved process involving a mysqldump, then search & replace, and import, as documented at https://www.whitesmith.co/blog/latin1-to-utf8/. I have a draft guide written on how to do that.

What about the less severe case, where tables are created as utf8, and we run SQL to CONVERT CHARACTER SET utf8mb4? I think that would be safe.

So there's 2 kinds of Jethro databases: those with old latin1 databases that need surgery, and those with utf8 databases that just need converting, and we cannot recommend the same SQL for both cases. Perhaps we'll need a PHP script, upgrades/2023-fixcharsets.php, which does the right thing.

jefft avatar Mar 07 '23 23:03 jefft

Nice analysis.

Any Jethro system created in the last few years will have utf8 encoding on each table, but possibly latin1 as the encoding for the overall database. Does this put it in category 1 (major surgery) or category 2 (smooth conversion)?

If it's only a few really ancient databases that are in category 1, I'd be happy to cover them with a warning in the upgrade notes and some instructions about how to dump, tweak and re-import. And/or ancient emojis can just suffer.

tbar0970 avatar Mar 08 '23 01:03 tbar0970

Any Jethro system created in the last few years will have utf8 encoding on each table, but possibly latin1 as the encoding for the overall database. Does this put it in category 1 (major surgery) or category 2 (smooth conversion)?

I don't know, but I'd expect the column-level charset to take precedence. If your columns are utf8 you're probably okay - just run some SQL to convert it (and the columns) to utf8mb4.

If it's only a few really ancient databases that are in category 1, I'd be happy to cover them with a warning in the upgrade notes and some instructions about how to dump, tweak and re-import.

Yes, I think that's fine.

I've tweaked my comment above, to point out that 😁 has a 4-byte utf8 representation.

As for how to tell category 1 and category 2 databases apart, 4-byte emojis are a useful shibboleth. Ask your Jethro "can you pronounce 😁"? image A modern Jethro using utf8 will not be able to store that. utf8 can only store 3-byte unicode, not 4-byte. You can see a utf8 Jethro fall over at https://easyjethro.com.au/demo/. However an old Jethro, with latin1 database and PHP 5.x will be able to store and display 😁. If the smiley shows, you know you'll need database surgery to convert latin1 to utf8mb4.

jefft avatar Mar 08 '23 02:03 jefft

Further diagnostic steps from Jeff:

Try running this query: SELECT T.table_name, CCSA.character_set_name FROM information_schema.TABLES T, information_schema.COLLATION_CHARACTER_SET_APPLICABILITY CCSA WHERE CCSA.collation_name = T.table_collation AND T.table_schema = DATABASE();

If you get lots of 'latin1', but smileys still work, your database is basically corrupt - storing utf8 bytes in latin1 tables. The fix is as described in https://www.whitesmith.co/blog/latin1-to-utf8/, to dump the database with no charset transcoding:

mysqldump --routines --events --no-data --skip-set-charset --default-character-set=latin1 jethro > jethro.sql

Then do a massive search & replace on the .sql dump:

cat jethro.sql | sed -e 's/CHARACTER SET [^ ]+/CHARACTER SET utf8mb4/g'
-e 's/DEFAULT CHARSET=[^;]+;/DEFAULT CHARSET=utf8mb4;/g'
-e 's/SET (character_set_client|character_set_results)( +)= ([^@][^ ]+)([ ;])/SET \1\2= utf8mb4\4/g'
-e 's/SET (collation_connection)( +)= ([^ ]+) /SET \1\2= utf8mb4_bin /g'
-e 's/COLLATE [^ ]+_ci/COLLATE utf8mb4_unicode_ci/g'
-e 's/SET NAMES [^ ]+ /SET NAMES utf8mb4 /g' \

fixedjethro.sql

Then import the fixed SQL.

tbar0970 avatar Apr 12 '23 13:04 tbar0970