Adobe-Runtime-Support icon indicating copy to clipboard operation
Adobe-Runtime-Support copied to clipboard

Character escaping \unnnn did not work with High Surrogate / Low Surrogate

Open ylazy opened this issue 7 months ago • 4 comments

Hi!

Running this sample code, you can see the incorrect outputs:

// High Surrogate:
trace("\uD83E".charCodeAt(0).toString(16).toLocaleUpperCase()); // 3F
trace("\uD83F".charCodeAt(0).toString(16).toLocaleUpperCase()); // 3F

// Low Surrogate:
trace("\uDCDC".charCodeAt(0).toString(16).toLocaleUpperCase()); // 3F
trace("\uDCBA".charCodeAt(0).toString(16).toLocaleUpperCase()); // 3F

var c1:String = "\uD83E";
var c2:String = "\uD83F";

trace(c1 == c2); // true

expected result:

// High Surrogate:
trace("\uD83E".charCodeAt(0).toString(16).toLocaleUpperCase()); // D83E
trace("\uD83F".charCodeAt(0).toString(16).toLocaleUpperCase()); // D83F

// Low Surrogate:
trace("\uDCDC".charCodeAt(0).toString(16).toLocaleUpperCase()); // DCDC
trace("\uDCBA".charCodeAt(0).toString(16).toLocaleUpperCase()); // DCBA

var c1:String = "\uD83E";
var c2:String = "\uD83F";

trace(c1 == c2); // false

Using \unnnn with High Surrogate (Code points from U+D800 to U+DBFF) or Low Surrogate (Code points from U+DC00 to U+DFFF) will result the question mark char (\x3F).

trace(c1 == "?"); // true

Workaround: Use String.fromCharCode

Please check this! Thanks!

ylazy avatar Apr 10 '25 01:04 ylazy

Same issue with RegExp:

trace("?".search(/[\uD800-\uDBFF]/)); // 0
trace(String.fromCharCode(0xD800).search(/[\uD800-\uDBFF]/)); // -1

Because /[\uD800-\uDBFF]/ is compiled to /[?-?]/

ylazy avatar Apr 10 '25 02:04 ylazy

Looks like the first one may be a compiler issue when it comes across that format.. although it's not a valid string? "\uD83E" Although it works in JavaScript..

console.log("\uD83E".charCodeAt(0));
55358

The second one I think is the same - looking at when we get the 'search' call, the "pattern" string from which we create the regular expression is as you say [?-?] and that's in the constant pool for the SWF i.e. created badly at compile-time.

We can check the compiler logic for handling these things...

thanks

ajwfrost avatar Apr 10 '25 06:04 ajwfrost

FYI, what you're seeing is the "normal" behaviour in Java, the string "\uD83E" in Java isn't really valid and if you then call String.getBytes("UTF-8"); then you get the single ? character back.

But to make it work more like JavaScript, we can do some custom encoding into UTF-8 for these cases....

ajwfrost avatar Apr 10 '25 17:04 ajwfrost

A lib that I'm building allows users to log outputs to the Debug Console. I used some RegExp patterns and the outputs may contain Emojis. So because https://github.com/airsdk/Adobe-Runtime-Support/issues/3735 exists, I must find a way to escape/unescape surrogate pairs before replacing things with RegExp. And because this issue exists, I also can't use RegExp to escape/unescape surrogate pairs. So I have to use String to search/replace and the performance is worse.

Anyway I did it:

Image

ylazy avatar Apr 15 '25 10:04 ylazy