jethro-pmm
jethro-pmm copied to clipboard
Emojis and 4-byte characters cause Jethro to break
Jethro currently breaks when emojis are entered. For instance, if I try to save SMS text containing '😊' into a Person note, it fails and an error is logged:
SQLSTATE[22007]: Invalid datetime format: 1366 Incorrect string value: '\xF0\x9F\x98\x8A' for column `jethro`.`note_comment`.`contents` at row 1
Line 240 of /srv/www/jethro/2.31.1/include/db_object.class.php
USER: 1
REFERER: https://jethro/?view=_edit_note¬e_type=person¬eid=20649
USER_AGENT: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.99 Safari/537.36
REQUEST:
Array
(
[view] => _edit_note
[note_type] => person
[noteid] => 20649
[update_note_submitted] => 1
[contents] => All good thanks, looking at getting back very soon
Thanks for reaching out 😊
[status] => pending
[assignee] => 1
[action_date_d] => 24
[action_date_m] => 1
[action_date_y] => 2022
)
The problem is that MySQL's 'utf8' encoding isn't real UTF-8, but a 3-byte subset that doesn't support all characters. The correct encoding to use in MySQL is 'utfmb4'. For more information, see here or here.
When I checked my Jethro database I found the database-level encoding was correct:
MariaDB [jethro]> SELECT default_character_set_name FROM information_schema.SCHEMATA
-> WHERE schema_name = "jethro";
+----------------------------+
| default_character_set_name |
+----------------------------+
| utf8mb4 |
+----------------------------+
but table-level encodings were wrong:
MariaDB [jethro]> SELECT CCSA.character_set_name FROM information_schema.`TABLES` T, information_schema.`COLLATION_CHARACTER_SET_APP
LICABILITY` CCSA WHERE CCSA.collation_name = T.table_collation AND T.table_schema = "jethro";
+--------------------+
| character_set_name |
+--------------------+
| utf8 |
| utf8 |
| utf8 |
| utf8 |
| utf8 |
Fixing this needs two changes:
- The database and table encodings need to be changed from
utf8
toutf8mb4
. - Jethro needs to initiate connections with
utf8mb4
rather thanutf8
, as is currently hardcoded when Jethro constructs the connection dsn: https://github.com/tbar0970/jethro-pmm/blob/269b7b420ec43b7691d6f27bd9077809eed0f57f/include/jethrodb.php#L53
Correspondingly, the fix requires two parts:
- Fix the database encodings. I followed this advice to generate appropriate SQL to upgrade all database tables. The result is in upgrades/2022-upgrade-to-2.32.sql
- Fix the database DSN connection string; done in include/jethrodb.php.
This requires MySQL 5.5.3 and above (released in 2010). I have adjusted the minimum MySQL version in README.md.
Note: I hope changing character sets isn't opening a can of worms. I've only tested on one (mariadb 10.3.32) instance. There may be more places where utf8
needs changing to utf8mb4
. I hope other users can test and confirm that upgrades go smoothly.
Hmm, I couldn't reproduce the original problem in my systems. Will dig in further to work out why.
Nice summary of the issue here: https://make.wordpress.org/core/2015/04/02/the-utf8mb4-upgrade/
Holding off on this for now since on MySQL 5.1 emojis are working fine...
Hmm, I couldn't reproduce the original problem in my systems. Will dig in further to work out why.
Could it be that you have the sms sanitising option turned on? That would automatically strip out all multi-byte characters... and you might never see the error!
I've updated #606 too match this too!
Hmm, I couldn't reproduce the original problem in my systems. Will dig in further to work out why.
Could it be that you have the sms sanitising option turned on? That would automatically strip out all multi-byte characters... and you might never see the error!
That wouldn't be it - the original problem was triggered by pasting emojis into a Jethro note, not by sending anything out.
Hey I'm thinking now would be a good time to merge this. Not sure what "closed by deleting the head repository" means or came from?
Holding off on this for now since on MySQL 5.1 emojis are working fine...
Yes, it was wise to hold off for that reason. The SQL in this PR would corrupt some older databases.
As your experience shows, emojis in MySQL 5.1 do work, if your database encoding is latin1
and you're using ancient PHP (5.3.3) which seems hardcode latin1
as the transfer encoding (ignoring charset=utf8
in include/jethrodb.php
). Say we have a latin1
table, and we ask MySQL to insert the 4 utf8 bytes representing 😊. MySQL goes "close enough lol ¯_(ツ)_/¯" and adds the 4 raw bytes, despite them not being valid latin1
. When asked to retrieve data, MySQL goes 'here's your bytes whatever they are'. Everything just works because PHP and MySQL store and load the raw bytes unaltered.
Now what happens if you take one of these latin1
databases containing utf8 bytes, and run the SQL I proposed in this PR? E.g.:
ALTER TABLE `family_note` CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
I suspect that might break your characters. Say you have a '😊' encoded as 4 bytes: MySQL will try to map those 4 utf8 bytes one by one, to whatever unicode characters live at those 3 codepoints.
For latin1
databases, what's needed is a more involved process involving a mysqldump, then search & replace, and import, as documented at https://www.whitesmith.co/blog/latin1-to-utf8/. I have a draft guide written on how to do that.
What about the less severe case, where tables are created as utf8
, and we run SQL to CONVERT CHARACTER SET utf8mb4
? I think that would be safe.
So there's 2 kinds of Jethro databases: those with old latin1
databases that need surgery, and those with utf8
databases that just need converting, and we cannot recommend the same SQL for both cases. Perhaps we'll need a PHP script, upgrades/2023-fixcharsets.php
, which does the right thing.
Nice analysis.
Any Jethro system created in the last few years will have utf8 encoding on each table, but possibly latin1 as the encoding for the overall database. Does this put it in category 1 (major surgery) or category 2 (smooth conversion)?
If it's only a few really ancient databases that are in category 1, I'd be happy to cover them with a warning in the upgrade notes and some instructions about how to dump, tweak and re-import. And/or ancient emojis can just suffer.
Any Jethro system created in the last few years will have utf8 encoding on each table, but possibly latin1 as the encoding for the overall database. Does this put it in category 1 (major surgery) or category 2 (smooth conversion)?
I don't know, but I'd expect the column-level charset to take precedence. If your columns are utf8
you're probably okay - just run some SQL to convert it (and the columns) to utf8mb4
.
If it's only a few really ancient databases that are in category 1, I'd be happy to cover them with a warning in the upgrade notes and some instructions about how to dump, tweak and re-import.
Yes, I think that's fine.
I've tweaked my comment above, to point out that 😁 has a 4-byte utf8 representation.
As for how to tell category 1 and category 2 databases apart, 4-byte emojis are a useful shibboleth. Ask your Jethro "can you pronounce 😁"?
A modern Jethro using
utf8
will not be able to store that. utf8
can only store 3-byte unicode, not 4-byte. You can see a utf8
Jethro fall over at https://easyjethro.com.au/demo/.
However an old Jethro, with latin1
database and PHP 5.x will be able to store and display 😁. If the smiley shows, you know you'll need database surgery to convert latin1
to utf8mb4
.
Further diagnostic steps from Jeff:
Try running this query:
SELECT T.table_name, CCSA.character_set_name FROM information_schema.TABLES
T, information_schema.COLLATION_CHARACTER_SET_APPLICABILITY
CCSA WHERE CCSA.collation_name = T.table_collation AND T.table_schema = DATABASE();
If you get lots of 'latin1', but smileys still work, your database is basically corrupt - storing utf8 bytes in latin1 tables. The fix is as described in https://www.whitesmith.co/blog/latin1-to-utf8/, to dump the database with no charset transcoding:
mysqldump --routines --events --no-data --skip-set-charset --default-character-set=latin1 jethro > jethro.sql
Then do a massive search & replace on the .sql dump:
cat jethro.sql | sed -e 's/CHARACTER SET [^ ]+/CHARACTER SET utf8mb4/g'
-e 's/DEFAULT CHARSET=[^;]+;/DEFAULT CHARSET=utf8mb4;/g'
-e 's/SET (character_set_client|character_set_results)( +)= ([^@][^ ]+)([ ;])/SET \1\2= utf8mb4\4/g'
-e 's/SET (collation_connection)( +)= ([^ ]+) /SET \1\2= utf8mb4_bin /g'
-e 's/COLLATE [^ ]+_ci/COLLATE utf8mb4_unicode_ci/g'
-e 's/SET NAMES [^ ]+ /SET NAMES utf8mb4 /g' \
fixedjethro.sql
Then import the fixed SQL.