Preg split not splitting some unicodes
From manual page: https://php.net/function.preg-split
These json encoded unicode characters in a string not splitted by the method \ud876\ude54 preg_split('//u', $str, null, PREG_SPLIT_NO_EMPTY);
These are Unicode surrogate code points. They don't correspond to a valid Unicode character. This is mojibake. You can't split them meaningfully into characters if they are not characters in the first place. What exactly is your expectation here?
From the PCRE docs:
In addition to checking the format of the string, there is a check to ensure that all code points lie in the range U+0 to U+10FFFF, excluding the surrogate area. The so-called "non-character" code points are not excluded because Unicode corrigendum # 9 makes it clear that they should not be.
Characters in the "Surrogate Area" of Unicode are reserved for use by UTF-16, where they are used in pairs to encode code points with values greater than 0xFFFF. The code points that are encoded by UTF-16 pairs are available independently in the UTF-8 and UTF-32 encodings. (In other words, the whole surrogate thing is a fudge for UTF-16 which unfortunately messes up UTF-8 and UTF-32.)
We may want to document, that surrogates are not supported; converting to Utf-8 first may yield the desired result.
Yes the characters may be non meaningful. It is a string of obfuscated data and I have to extract the unicode code point value of it and add a constant and then get the character corresponding to that code point value
I tried to make a method to deobfuscate a string. In java it works correctly, but in php it fails
2021 ජූනි 7, සඳුදා 17:15 දින Kamil Tekiela @.***> ලිව්වා:
These are Unicode surrogate code points. They don't correspond to a valid Unicode character. This is mojibake. You can't split them meaningfully into characters if they are not characters in the first place. What exactly is your expectation here?
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/php/doc-en/issues/665#issuecomment-855856554, or unsubscribe https://github.com/notifications/unsubscribe-auth/ARTMGVFV46ELBAFFGBQHPPTTRSWPFANCNFSM46HMQ2KQ .
Java works with UTF-16, PHP' PCRE with UTF-8.