Adobe-Runtime-Support
Adobe-Runtime-Support copied to clipboard
Character escaping \unnnn did not work with High Surrogate / Low Surrogate
Hi!
Running this sample code, you can see the incorrect outputs:
// High Surrogate:
trace("\uD83E".charCodeAt(0).toString(16).toLocaleUpperCase()); // 3F
trace("\uD83F".charCodeAt(0).toString(16).toLocaleUpperCase()); // 3F
// Low Surrogate:
trace("\uDCDC".charCodeAt(0).toString(16).toLocaleUpperCase()); // 3F
trace("\uDCBA".charCodeAt(0).toString(16).toLocaleUpperCase()); // 3F
var c1:String = "\uD83E";
var c2:String = "\uD83F";
trace(c1 == c2); // true
expected result:
// High Surrogate:
trace("\uD83E".charCodeAt(0).toString(16).toLocaleUpperCase()); // D83E
trace("\uD83F".charCodeAt(0).toString(16).toLocaleUpperCase()); // D83F
// Low Surrogate:
trace("\uDCDC".charCodeAt(0).toString(16).toLocaleUpperCase()); // DCDC
trace("\uDCBA".charCodeAt(0).toString(16).toLocaleUpperCase()); // DCBA
var c1:String = "\uD83E";
var c2:String = "\uD83F";
trace(c1 == c2); // false
Using \unnnn with High Surrogate (Code points from U+D800 to U+DBFF) or Low Surrogate (Code points from U+DC00 to U+DFFF) will result the question mark char (\x3F).
trace(c1 == "?"); // true
Workaround: Use String.fromCharCode
Please check this! Thanks!
Same issue with RegExp:
trace("?".search(/[\uD800-\uDBFF]/)); // 0
trace(String.fromCharCode(0xD800).search(/[\uD800-\uDBFF]/)); // -1
Because /[\uD800-\uDBFF]/ is compiled to /[?-?]/
Looks like the first one may be a compiler issue when it comes across that format.. although it's not a valid string? "\uD83E"
Although it works in JavaScript..
console.log("\uD83E".charCodeAt(0));
55358
The second one I think is the same - looking at when we get the 'search' call, the "pattern" string from which we create the regular expression is as you say [?-?] and that's in the constant pool for the SWF i.e. created badly at compile-time.
We can check the compiler logic for handling these things...
thanks
FYI, what you're seeing is the "normal" behaviour in Java, the string "\uD83E" in Java isn't really valid and if you then call String.getBytes("UTF-8"); then you get the single ? character back.
But to make it work more like JavaScript, we can do some custom encoding into UTF-8 for these cases....
A lib that I'm building allows users to log outputs to the Debug Console. I used some RegExp patterns and the outputs may contain Emojis. So because https://github.com/airsdk/Adobe-Runtime-Support/issues/3735 exists, I must find a way to escape/unescape surrogate pairs before replacing things with RegExp. And because this issue exists, I also can't use RegExp to escape/unescape surrogate pairs. So I have to use String to search/replace and the performance is worse.
Anyway I did it: