duktape
duktape copied to clipboard
Regexp [\s\S]* not matching emojis
Hi, apologies if this is not a Duktape issue.
I'm actually updating an old plugin for Movian and I'm new to both as a coder.
So as the title says after reading some web content using showtime.httpReq(), the regex match and exec parsing will abruptly end upon finding an emoji, as if they weren't matching \S. But all online Ecmascript compliant regex testers I tried say it should match.
I cannot reproduce the problem with a snippet because I can't figure out how to properly build a string with an emoji. I tried several ways, including codepoint format, octets and whatnot, but for some reason d() (Not sure it's a print debug function of Movian or Duktape) doesn't print the emojis right whereas it does print the ones from the web just fine to the terminal. I have checked the web source and they don't appear to be escaped, however pasting the emoji as is in vscode gives me ugly output as well.
Any pointers please?
Edit: where are the sources for regex?
Can you provide an example URI so I can build a testcase?
The emoji character is likely non-BMP and ECMAScript sees it as two codepoints (surrogate pair). It would match when using the Unicode flag but Duktape doesn't support that yet.
RegExp matcher is here: https://github.com/svaarala/duktape/blob/master/src-input/duk_regexp_executor.c.
Also worth noting that in Duktape 2.x when you push a string it is accepted "as is". In particular this means:
- If the input UTF-8 contains non-BMP UTF-8 codepoints they will be accepted as is into the internal representation. ECMAScript code will see them as non-standard non-BMP codepoints (and not surrogate pairs).
- If the input contains non-BMP surrogate pairs encoded as separate codepoints (CESU-8 / WTF-8 style), they are also accepted as is, and ECMAScript will see them as surrogate pairs.
So first off you'd want to decide how you want the string to appear in ECMAScript, and then check if the RegExp works as expected or not. For the latter you might need to transcode from UTF-8 to CESU-8/WTF-8 style when the strings are pushed.
Duktape master has switched to WTF-8 representation which means that Duktape will now always normalize pushed strings to WTF-8. So regardless of which kind of string (in the above bullet list) is pushed, it ends up in WTF-8 i.e. ECMAScript will see surrogate pairs (and C code will always see valid UTF-8, i.e. combined codepoints, whenever possible).