doc-en icon indicating copy to clipboard operation
doc-en copied to clipboard

Preg split not splitting some unicodes

Open viraj-bookanna opened this issue 4 years ago • 4 comments

From manual page: https://php.net/function.preg-split


These json encoded unicode characters in a string not splitted by the method \ud876\ude54 preg_split('//u', $str, null, PREG_SPLIT_NO_EMPTY);

viraj-bookanna avatar Jun 07 '21 10:06 viraj-bookanna

These are Unicode surrogate code points. They don't correspond to a valid Unicode character. This is mojibake. You can't split them meaningfully into characters if they are not characters in the first place. What exactly is your expectation here?

kamil-tekiela avatar Jun 07 '21 11:06 kamil-tekiela

From the PCRE docs:

In addition to checking the format of the string, there is a check to ensure that all code points lie in the range U+0 to U+10FFFF, excluding the surrogate area. The so-called "non-character" code points are not excluded because Unicode corrigendum # 9 makes it clear that they should not be.

Characters in the "Surrogate Area" of Unicode are reserved for use by UTF-16, where they are used in pairs to encode code points with values greater than 0xFFFF. The code points that are encoded by UTF-16 pairs are available independently in the UTF-8 and UTF-32 encodings. (In other words, the whole surrogate thing is a fudge for UTF-16 which unfortunately messes up UTF-8 and UTF-32.)

We may want to document, that surrogates are not supported; converting to Utf-8 first may yield the desired result.

cmb69 avatar Jun 07 '21 11:06 cmb69

Yes the characters may be non meaningful. It is a string of obfuscated data and I have to extract the unicode code point value of it and add a constant and then get the character corresponding to that code point value

I tried to make a method to deobfuscate a string. In java it works correctly, but in php it fails

2021 ජූනි 7, සඳුදා 17:15 දින Kamil Tekiela @.***> ලිව්වා:

These are Unicode surrogate code points. They don't correspond to a valid Unicode character. This is mojibake. You can't split them meaningfully into characters if they are not characters in the first place. What exactly is your expectation here?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/php/doc-en/issues/665#issuecomment-855856554, or unsubscribe https://github.com/notifications/unsubscribe-auth/ARTMGVFV46ELBAFFGBQHPPTTRSWPFANCNFSM46HMQ2KQ .

viraj-bookanna avatar Jun 07 '21 12:06 viraj-bookanna

Java works with UTF-16, PHP' PCRE with UTF-8.

cmb69 avatar Jun 07 '21 12:06 cmb69