oxc
oxc copied to clipboard
codegen: string surrogates printed incorrectly
Related: https://github.com/oxc-project/oxc/issues/3518#issuecomment-2146978752
https://oxc-project.github.io/oxc/playground/?code=3YCAAICQgICAgICAgIC2mcpqpcsDOJx5VHQkvaEBWf7tgIx%2FfaNQgA%3D%3D
We need to spend some time to finish printing strings and number
https://github.com/evanw/esbuild/blob/ac54f06dbfb146aee3c3c7bea174c7cd248d3194/internal/js_printer/js_printer.go#L64
let a = '\udf06' is broken because it's a a surrogate number https://github.com/evanw/esbuild/blob/ac54f06dbfb146aee3c3c7bea174c7cd248d3194/internal/js_printer/js_printer.go#L197
I fully ported printUnquotedUTF16 before, but it didn't solve the problem
The oxc parser actually parses "\udf06" the same as "\\udf06", so there's no way to distinguish these two strings when printing.
I ran into this issue because I'm working on a JS engine called Nova which uses oxc as its parser, and we can't use the test262 tests dealing with lone surrogates because of this.
I finally understood where the problem is after looking at other implementations written in Rust: https://github.com/boa-dev/boa/blob/532c00862c72d1e7a284a37ba5bb31ec20d2d8a9/core/parser/src/lexer/string.rs#L147
Our implementation was copied from jsparagus, where they didn't implement this part yet
https://github.com/oxc-project/oxc/blob/92ee77487f0fb929f7f3d8098906c519330198a7/crates/oxc_parser/src/lexer/unicode.rs#L239
While parsing '\\udf06', it pushes \ u d f 0 6 and the string becomes a valid utf8
what we need is https://github.com/boa-dev/boa/blob/532c00862c72d1e7a284a37ba5bb31ec20d2d8a9/core/parser/src/lexer/string.rs#L56
impl UTF16CodeUnitsBuffer for Vec<u16> {
fn push_code_point(&mut self, mut code_point: u32) {
if let Ok(cp) = code_point.try_into() {
self.push(cp);
return;
}
code_point -= 0x10000;
let cu1 = (code_point / 1024 + 0xD800)
.try_into()
.expect("decoded an u32 into two u16.");
let cu2 = (code_point % 1024 + 0xDC00)
.try_into()
.expect("decoded an u32 into two u16.");
self.push(cu1);
self.push(cu2);
}
}
I'm super confused right now, how should we store these two utf16 strings "\udf06" and "\\udf06" in StringLiteral::value as Vec<u8>? Should we change StringLiteral::value to Vec<u16> instead?
@andreubotella Can you help me out a little bit here? As I don't fully grasp utf8, utf16 and surrogates.
Similarly to how not all &[u8] are valid UTF-8 strings, not all &[u16] are valid UTF-16 strings. However, this is the way UTF-16 is defined today. In the early days of Unicode, only 2^16 characters were ever planned, and languages like JS essentially used &[u16] as a canonical representation of Unicode, where each item was supposed to be a character. Later, Unicode grew, and UTF-16 was changed so it could represent higher characters, but that meant that some &[u16] sequences were no longer valid (that is, lone surrogates). And it was too late for languages like JS to forbid them.
What this means is that some JS strings are invalid UTF-16, and for those strings there isn't a valid UTF-8 representation.
In Nova we're planning to deal with this by using the wtf8 crate, which gives you a superset of UTF-8 that can also represent lone surrogates. But this crate doesn't give you all of the features String and str does.
Another possibility would be to keep a valid UTF-8 representation of the string, where you would encode lone surrogates as the Unicode replacement character (\uFFFD, �), which is what String::from_utf16_lossy does. And you would additionally have an optional Vec<u16> representation, only for invalid UTF-16 strings. If the Vec<u16> representation is Some, you would know the Vec<u8> representation is lossy.
Just to check: This problem only applies to StringLiterals, right? Identifiers can also contain \u escapes. Can they be invalid too? Making StringLiteral::value a &[u16] (or something) I think we could live with, but it'd be a bit of a nightmare if we had to do the same for BindingIdentifier::name etc, to accommodate this tiny edge case.
A quick research among native languages:
- esbuild and boa uses
[u16] - swc doesn't handle surrogates, ast explorer shows
"\\udf06"and"\\\u0000udf06"
If we are implementing a runtime, I'd go as far as changing it to [u16] in our AST.
But, given that we need to interact with the Rust world for easy string comparisons, I'm in favor of lossy string + Option<Vec<u16>>.
Just to check: This problem only applies to
StringLiterals, right? Identifiers can also contain\uescapes. Can they be invalid too? MakingStringLiteral::valuea&[u16](or something) I think we could live with, but it'd be a bit of a nightmare if we had to do the same forBindingIdentifier::nameetc, to accommodate this tiny edge case.
My understanding is that escapes in identifiers can only encode Unicode characters that would otherwise be valid in that position in the identifier. And I think lone surrogates are not valid ID_Start or ID_Continue. So that could still use a regular str.
Yes, from Mathias: https://mathiasbynens.be/notes/javascript-identifiers-es6
Yes, from Mathias: https://mathiasbynens.be/notes/javascript-identifiers-es6
Oh thank god! This problem was already annoying, but if it affected identifiers too, it would have been really annoying.
I'm in favor of lossy string +
Option<Vec<u16>>.
I don't have a set view on which approach is better, but if going for the extra field, could we make it Option<Box<&[u16]>> so it's 8 bytes, rather than 16? (given that this field will be None 99.9% of the time)
The code here is wrong https://github.com/oxc-project/oxc/blob/4757376136441ffd2ceb6bed97b94e6351b94b22/crates/oxc_parser/src/lexer/unicode.rs#L126-L143
We need mark the string as "broken" in the char::try_from(code_point) error branch.
I think serde_json error from deserializing acorn's json output is also due to broken code point. https://github.com/oxc-project/oxc/blob/567bc2c0a298d29c14f9ef5d6170524f77f995c0/tasks/coverage/src/tools/estree.rs#L47-L48
I found a series of issues and PRs by Deno team on this https://github.com/serde-rs/json/issues/827, https://github.com/serde-rs/json/pull/830, https://github.com/serde-rs/json/pull/877, but I'm not sure if serde_json currently allows reading lone surrogate in an easy way.
I'm working on ditching serde_json entirely, because we need a stateful serializer, and that's difficult within serde's framework. serde is fiendishly complicated, but it turns out once you strip it down, remove all the generics, and specialize it for just serializing to JSON, what's left is pretty simple. Work in progress (currently messy, but it'll get better!): https://github.com/oxc-project/oxc/tree/overlookmotel/estree-serializer2
So potentially we can look at fixing it in our serde replacement.
I was wrong. We're not going to be able to solve this at the level of the serializer. Even if serde didn't bug out trying to deserialize the JSON which was generated by serializing Acorn's AST, all these ESTree conformance tests would fail anyway, because Oxc is not parsing these strings correctly in the first place. The problem is that there's no way to represent a lone surrogate (e.g. \ud834) as a Rust &str, because it's not a valid unicode character, and &str must be valid UTF-8.
For purposes of ESTree, we'll need to ignore these tests until we fix it further up the chain in Oxc.
In #9918, I made the following changes:
In StringLiteral, added the lossy field to indicate whether the value contains a lossy replacement character (U+FFFD).
In codegen, print the raw string directly if it is lossy.
I'm not sure if this information is enough for Nova, but I think it can have a separate code path to handle lossy strings.
In #9918, I made the following changes:
In
StringLiteral, added thelossyfield to indicate whether thevaluecontains a lossy replacement character (U+FFFD).In codegen, print the raw string directly if it is lossy.
I'm not sure if this information is enough for Nova, but I think it can have a separate code path to handle
lossystrings.
Hi, this wouldn't be enough, because even if we know that there's some lone surrogate that has been converted into the replacement character, there wouldn't be a way to get back the original invalid UTF-16.
In #9918, I made the following changes: In
StringLiteral, added thelossyfield to indicate whether thevaluecontains a lossy replacement character (U+FFFD). In codegen, print the raw string directly if it is lossy. I'm not sure if this information is enough for Nova, but I think it can have a separate code path to handlelossystrings.Hi, this wouldn't be enough, because even if we know that there's some lone surrogate that has been converted into the replacement character, there wouldn't be a way to get back the original invalid UTF-16.
Oh sorry, forgot to mention that we now have a raw field, or you can slice it from the source text.
Oh sorry, forgot to mention that we now have a
rawfield, or you can slice it from the source text.
Oh, my bad! In that case, I think that would be enough, although we'd have to make sure our parsing of the raw string matches the spec in everything else (escapes, etc...). Thank you so much for working on this!
Oh sorry, forgot to mention that we now have a
rawfield, or you can slice it from the source text.Oh, my bad! In that case, I think that would be enough, although we'd have to make sure our parsing of the raw string matches the spec in everything else (escapes, etc...). Thank you so much for working on this!
Most of the code is in https://github.com/oxc-project/oxc/blob/main/crates/oxc_parser/src/lexer/unicode.rs, feel free to copy them over.
I tried to get the Test262/ESTree tests related to lone surrogates etc to pass after #9918. Some pass, but some don't.
I've not figured out why yet, but will try to. Re-opening in the meantime.
Not fixed yet! But #10041 should do it.