CssToInlineStyles icon indicating copy to clipboard operation
CssToInlineStyles copied to clipboard

Recent change in #234 munges utf8mb4 bytes (like 🧮)

Open jstanden opened this issue 2 years ago • 5 comments

This commit in #234 changing from mb_convert_encoding() to mb_encode_numericentity() introduces an issue that munges utf8mb4 bytes in a string.

This appears to affect any 4-byte character, like 🧮, but not fewer bytes like ✈️.

The output from $cssToInlineStyles->convert() ends up looking like ð§.

We had to pin back to 2.2.4 in Composer to resolve an issue with outgoing mail that runs through this function.

jstanden avatar Oct 05 '22 00:10 jstanden

@Ugoku This test case should probably be added to the unit tests.

jstanden avatar Oct 05 '22 00:10 jstanden

Here is a comparison of the functions' behaviour: https://3v4l.org/bD68C

mhujer avatar Oct 12 '22 13:10 mhujer

@alexdowad looks like you were the one deprecating the HTML-ENTITIES encoding in mbstring in https://github.com/php/php-src/pull/7594. Could you provide the equivalent code for the old code (mb_convert_encoding($html, 'HTML-ENTITIES', 'UTF-8'))) ? I haven't found any migration documentation related to that documentation, and it looks like the existing replacement attempt is not equivalent.

stof avatar Oct 12 '22 13:10 stof

Hi, @stof. I've never heard of this project before and don't have any context about what you are doing here, but from the sample code which @mhujer provided, it looks like you want to convert codepoints above U+FFFF to HTML numeric entities. If that is the case, then you need to pass mb_encode_numericentity a convmap argument which tells it to do that. Example: https://3v4l.org/sRM3F

alexdowad avatar Oct 12 '22 14:10 alexdowad

@stof Sorry, one more comment, just in case... in the above sample code, I told mb_encode_numericentity to encode all codepoints from U+0080 up to U+1FFFF, but actually, the highest possible legal value for a Unicode codepoint (although it's not currently allocated) is U+10FFFF. Therefore, you probably want to use a convmap of [0x80, 0x10FFFF, 0, 0x1FFFFF]. Please do not set the 4th "bitmask" value in the convmap to 0x10FFFF; it needs to be 0x1FFFFF.

You would do well to include test cases in your test suite for U+007E (ASCII tilda), U+007F (ASCII "delete character"), U+0080 (lowest codepoint which will be converted to HTML entity), U+10FFFF (highest codepoint which will be converted), and U+110000 (illegal). You would also do well to include test cases for various types of invalid UTF-8 strings: strings with over-long code units, strings with continuation bytes appearing outside of a multi-byte character, strings which are truncated so a multi-byte character doesn't have the required number of continuation bytes, etc...

All the best with your project!

alexdowad avatar Oct 13 '22 00:10 alexdowad

@jstanden it seems #234 is indeed to blame. This line was how Symfony "fixed" it, but it breaks certain characters as you say. The fix from @alexdowad works for me, I will submit a PR with this fix and add some tests.

Ugoku avatar Dec 20 '22 09:12 Ugoku

BTW, I am thinking of submitting a PHP RFC so we add another built-in function like mb_encode_numericentity, but with a better API. The convmap argument for mb_encode_numericentity is very confusing and I doubt that more than a handful of users actually need anything other than [0, 0x10FFFF, 0, 0x1FFFFF].

I have half a mind to reach out to Moriyoshi-san and try to find out what that convmap argument was originally intended to be used for.

alexdowad avatar Dec 20 '22 09:12 alexdowad