CssToInlineStyles
CssToInlineStyles copied to clipboard
Recent change in #234 munges utf8mb4 bytes (like 🧮)
This commit in #234 changing from mb_convert_encoding()
to mb_encode_numericentity()
introduces an issue that munges utf8mb4 bytes in a string.
This appears to affect any 4-byte character, like 🧮, but not fewer bytes like ✈️.
The output from $cssToInlineStyles->convert()
ends up looking like ð§
.
We had to pin back to 2.2.4 in Composer to resolve an issue with outgoing mail that runs through this function.
@Ugoku This test case should probably be added to the unit tests.
Here is a comparison of the functions' behaviour: https://3v4l.org/bD68C
@alexdowad looks like you were the one deprecating the HTML-ENTITIES
encoding in mbstring in https://github.com/php/php-src/pull/7594. Could you provide the equivalent code for the old code (mb_convert_encoding($html, 'HTML-ENTITIES', 'UTF-8'))
) ? I haven't found any migration documentation related to that documentation, and it looks like the existing replacement attempt is not equivalent.
Hi, @stof. I've never heard of this project before and don't have any context about what you are doing here, but from the sample code which @mhujer provided, it looks like you want to convert codepoints above U+FFFF to HTML numeric entities. If that is the case, then you need to pass mb_encode_numericentity
a convmap argument which tells it to do that. Example: https://3v4l.org/sRM3F
@stof Sorry, one more comment, just in case... in the above sample code, I told mb_encode_numericentity
to encode all codepoints from U+0080 up to U+1FFFF, but actually, the highest possible legal value for a Unicode codepoint (although it's not currently allocated) is U+10FFFF. Therefore, you probably want to use a convmap of [0x80, 0x10FFFF, 0, 0x1FFFFF]
. Please do not set the 4th "bitmask" value in the convmap to 0x10FFFF; it needs to be 0x1FFFFF.
You would do well to include test cases in your test suite for U+007E (ASCII tilda), U+007F (ASCII "delete character"), U+0080 (lowest codepoint which will be converted to HTML entity), U+10FFFF (highest codepoint which will be converted), and U+110000 (illegal). You would also do well to include test cases for various types of invalid UTF-8 strings: strings with over-long code units, strings with continuation bytes appearing outside of a multi-byte character, strings which are truncated so a multi-byte character doesn't have the required number of continuation bytes, etc...
All the best with your project!
@jstanden it seems #234 is indeed to blame. This line was how Symfony "fixed" it, but it breaks certain characters as you say. The fix from @alexdowad works for me, I will submit a PR with this fix and add some tests.
BTW, I am thinking of submitting a PHP RFC so we add another built-in function like mb_encode_numericentity
, but with a better API. The convmap
argument for mb_encode_numericentity
is very confusing and I doubt that more than a handful of users actually need anything other than [0, 0x10FFFF, 0, 0x1FFFFF]
.
I have half a mind to reach out to Moriyoshi-san and try to find out what that convmap
argument was originally intended to be used for.