CssToInlineStyles Recent change in #234 munges utf8mb4 bytes (like 🧮)

This commit in #234 changing from mb_convert_encoding() to mb_encode_numericentity() introduces an issue that munges utf8mb4 bytes in a string.

This appears to affect any 4-byte character, like 🧮, but not fewer bytes like ✈️.

The output from $cssToInlineStyles->convert() ends up looking like ð§.

We had to pin back to 2.2.4 in Composer to resolve an issue with outgoing mail that runs through this function.

Oct 05 '22 00:10 jstanden

@Ugoku This test case should probably be added to the unit tests.

Oct 05 '22 00:10 jstanden

Here is a comparison of the functions' behaviour: https://3v4l.org/bD68C

Oct 12 '22 13:10 mhujer

@alexdowad looks like you were the one deprecating the HTML-ENTITIES encoding in mbstring in https://github.com/php/php-src/pull/7594. Could you provide the equivalent code for the old code (mb_convert_encoding($html, 'HTML-ENTITIES', 'UTF-8'))) ? I haven't found any migration documentation related to that documentation, and it looks like the existing replacement attempt is not equivalent.

Oct 12 '22 13:10 stof

Hi, @stof. I've never heard of this project before and don't have any context about what you are doing here, but from the sample code which @mhujer provided, it looks like you want to convert codepoints above U+FFFF to HTML numeric entities. If that is the case, then you need to pass mb_encode_numericentity a convmap argument which tells it to do that. Example: https://3v4l.org/sRM3F

Oct 12 '22 14:10 alexdowad

@stof Sorry, one more comment, just in case... in the above sample code, I told mb_encode_numericentity to encode all codepoints from U+0080 up to U+1FFFF, but actually, the highest possible legal value for a Unicode codepoint (although it's not currently allocated) is U+10FFFF. Therefore, you probably want to use a convmap of [0x80, 0x10FFFF, 0, 0x1FFFFF]. Please do not set the 4th "bitmask" value in the convmap to 0x10FFFF; it needs to be 0x1FFFFF.

You would do well to include test cases in your test suite for U+007E (ASCII tilda), U+007F (ASCII "delete character"), U+0080 (lowest codepoint which will be converted to HTML entity), U+10FFFF (highest codepoint which will be converted), and U+110000 (illegal). You would also do well to include test cases for various types of invalid UTF-8 strings: strings with over-long code units, strings with continuation bytes appearing outside of a multi-byte character, strings which are truncated so a multi-byte character doesn't have the required number of continuation bytes, etc...

All the best with your project!

Oct 13 '22 00:10 alexdowad

@jstanden it seems #234 is indeed to blame. This line was how Symfony "fixed" it, but it breaks certain characters as you say. The fix from @alexdowad works for me, I will submit a PR with this fix and add some tests.

Dec 20 '22 09:12 Ugoku

BTW, I am thinking of submitting a PHP RFC so we add another built-in function like mb_encode_numericentity, but with a better API. The convmap argument for mb_encode_numericentity is very confusing and I doubt that more than a handful of users actually need anything other than [0, 0x10FFFF, 0, 0x1FFFFF].

I have half a mind to reach out to Moriyoshi-san and try to find out what that convmap argument was originally intended to be used for.

Dec 20 '22 09:12 alexdowad

CssToInlineStyles CssToInlineStyles copied to clipboard

Recent change in #234 munges utf8mb4 bytes (like 🧮)

CssToInlineStyles
CssToInlineStyles copied to clipboard