oxc codegen: string surrogates printed incorrectly

trafficstars

Related: https://github.com/oxc-project/oxc/issues/3518#issuecomment-2146978752

https://oxc-project.github.io/oxc/playground/?code=3YCAAICQgICAgICAgIC2mcpqpcsDOJx5VHQkvaEBWf7tgIx%2FfaNQgA%3D%3D

Jun 04 '24 09:06 Dunqing

We need to spend some time to finish printing strings and number

https://github.com/evanw/esbuild/blob/ac54f06dbfb146aee3c3c7bea174c7cd248d3194/internal/js_printer/js_printer.go#L64

let a = '\udf06' is broken because it's a a surrogate number https://github.com/evanw/esbuild/blob/ac54f06dbfb146aee3c3c7bea174c7cd248d3194/internal/js_printer/js_printer.go#L197

Jun 23 '24 06:06 Boshen

I fully ported printUnquotedUTF16 before, but it didn't solve the problem

Jun 23 '24 08:06 Dunqing

The oxc parser actually parses "\udf06" the same as "\\udf06", so there's no way to distinguish these two strings when printing.

I ran into this issue because I'm working on a JS engine called Nova which uses oxc as its parser, and we can't use the test262 tests dealing with lone surrogates because of this.

Jul 04 '24 02:07 andreubotella

I finally understood where the problem is after looking at other implementations written in Rust: https://github.com/boa-dev/boa/blob/532c00862c72d1e7a284a37ba5bb31ec20d2d8a9/core/parser/src/lexer/string.rs#L147

Our implementation was copied from jsparagus, where they didn't implement this part yet

https://github.com/oxc-project/oxc/blob/92ee77487f0fb929f7f3d8098906c519330198a7/crates/oxc_parser/src/lexer/unicode.rs#L239

While parsing '\\udf06', it pushes \ u d f 0 6 and the string becomes a valid utf8

what we need is https://github.com/boa-dev/boa/blob/532c00862c72d1e7a284a37ba5bb31ec20d2d8a9/core/parser/src/lexer/string.rs#L56

impl UTF16CodeUnitsBuffer for Vec<u16> {
    fn push_code_point(&mut self, mut code_point: u32) {
        if let Ok(cp) = code_point.try_into() {
            self.push(cp);
            return;
        }
        code_point -= 0x10000;

        let cu1 = (code_point / 1024 + 0xD800)
            .try_into()
            .expect("decoded an u32 into two u16.");
        let cu2 = (code_point % 1024 + 0xDC00)
            .try_into()
            .expect("decoded an u32 into two u16.");
        self.push(cu1);
        self.push(cu2);
    }
}

Jul 12 '24 05:07 Boshen

I'm super confused right now, how should we store these two utf16 strings "\udf06" and "\\udf06" in StringLiteral::value as Vec<u8>? Should we change StringLiteral::value to Vec<u16> instead?

@andreubotella Can you help me out a little bit here? As I don't fully grasp utf8, utf16 and surrogates.

Jul 12 '24 06:07 Boshen

Similarly to how not all &[u8] are valid UTF-8 strings, not all &[u16] are valid UTF-16 strings. However, this is the way UTF-16 is defined today. In the early days of Unicode, only 2^16 characters were ever planned, and languages like JS essentially used &[u16] as a canonical representation of Unicode, where each item was supposed to be a character. Later, Unicode grew, and UTF-16 was changed so it could represent higher characters, but that meant that some &[u16] sequences were no longer valid (that is, lone surrogates). And it was too late for languages like JS to forbid them.

What this means is that some JS strings are invalid UTF-16, and for those strings there isn't a valid UTF-8 representation.

In Nova we're planning to deal with this by using the wtf8 crate, which gives you a superset of UTF-8 that can also represent lone surrogates. But this crate doesn't give you all of the features String and str does.

Another possibility would be to keep a valid UTF-8 representation of the string, where you would encode lone surrogates as the Unicode replacement character (\uFFFD, �), which is what String::from_utf16_lossy does. And you would additionally have an optional Vec<u16> representation, only for invalid UTF-16 strings. If the Vec<u16> representation is Some, you would know the Vec<u8> representation is lossy.

Jul 12 '24 06:07 andreubotella

Just to check: This problem only applies to StringLiterals, right? Identifiers can also contain \u escapes. Can they be invalid too? Making StringLiteral::value a &[u16] (or something) I think we could live with, but it'd be a bit of a nightmare if we had to do the same for BindingIdentifier::name etc, to accommodate this tiny edge case.

Jul 12 '24 09:07 overlookmotel

A quick research among native languages:

esbuild and boa uses [u16]
swc doesn't handle surrogates, ast explorer shows "\\udf06" and "\\\u0000udf06"

If we are implementing a runtime, I'd go as far as changing it to [u16] in our AST.

But, given that we need to interact with the Rust world for easy string comparisons, I'm in favor of lossy string + Option<Vec<u16>>.

Jul 12 '24 13:07 Boshen

Just to check: This problem only applies to StringLiterals, right? Identifiers can also contain \u escapes. Can they be invalid too? Making StringLiteral::value a &[u16] (or something) I think we could live with, but it'd be a bit of a nightmare if we had to do the same for BindingIdentifier::name etc, to accommodate this tiny edge case.

My understanding is that escapes in identifiers can only encode Unicode characters that would otherwise be valid in that position in the identifier. And I think lone surrogates are not valid ID_Start or ID_Continue. So that could still use a regular str.

Jul 12 '24 13:07 andreubotella

Yes, from Mathias: https://mathiasbynens.be/notes/javascript-identifiers-es6

Jul 12 '24 13:07 Boshen

Yes, from Mathias: https://mathiasbynens.be/notes/javascript-identifiers-es6

Oh thank god! This problem was already annoying, but if it affected identifiers too, it would have been really annoying.

I'm in favor of lossy string + Option<Vec<u16>>.

I don't have a set view on which approach is better, but if going for the extra field, could we make it Option<Box<&[u16]>> so it's 8 bytes, rather than 16? (given that this field will be None 99.9% of the time)

Jul 12 '24 14:07 overlookmotel

The code here is wrong https://github.com/oxc-project/oxc/blob/4757376136441ffd2ceb6bed97b94e6351b94b22/crates/oxc_parser/src/lexer/unicode.rs#L126-L143

We need mark the string as "broken" in the char::try_from(code_point) error branch.

Oct 23 '24 10:10 Boshen

I think serde_json error from deserializing acorn's json output is also due to broken code point. https://github.com/oxc-project/oxc/blob/567bc2c0a298d29c14f9ef5d6170524f77f995c0/tasks/coverage/src/tools/estree.rs#L47-L48

I found a series of issues and PRs by Deno team on this https://github.com/serde-rs/json/issues/827, https://github.com/serde-rs/json/pull/830, https://github.com/serde-rs/json/pull/877, but I'm not sure if serde_json currently allows reading lone surrogate in an easy way.

Feb 11 '25 09:02 hi-ogawa

I'm working on ditching serde_json entirely, because we need a stateful serializer, and that's difficult within serde's framework. serde is fiendishly complicated, but it turns out once you strip it down, remove all the generics, and specialize it for just serializing to JSON, what's left is pretty simple. Work in progress (currently messy, but it'll get better!): https://github.com/oxc-project/oxc/tree/overlookmotel/estree-serializer2

So potentially we can look at fixing it in our serde replacement.

Feb 11 '25 10:02 overlookmotel

I was wrong. We're not going to be able to solve this at the level of the serializer. Even if serde didn't bug out trying to deserialize the JSON which was generated by serializing Acorn's AST, all these ESTree conformance tests would fail anyway, because Oxc is not parsing these strings correctly in the first place. The problem is that there's no way to represent a lone surrogate (e.g. \ud834) as a Rust &str, because it's not a valid unicode character, and &str must be valid UTF-8.

For purposes of ESTree, we'll need to ignore these tests until we fix it further up the chain in Oxc.

Feb 11 '25 18:02 overlookmotel

In #9918, I made the following changes:

In StringLiteral, added the lossy field to indicate whether the value contains a lossy replacement character (U+FFFD).

In codegen, print the raw string directly if it is lossy.

I'm not sure if this information is enough for Nova, but I think it can have a separate code path to handle lossy strings.

Mar 20 '25 08:03 Boshen

In #9918, I made the following changes:

In StringLiteral, added the lossy field to indicate whether the value contains a lossy replacement character (U+FFFD).

In codegen, print the raw string directly if it is lossy.

I'm not sure if this information is enough for Nova, but I think it can have a separate code path to handle lossy strings.

Hi, this wouldn't be enough, because even if we know that there's some lone surrogate that has been converted into the replacement character, there wouldn't be a way to get back the original invalid UTF-16.

Mar 20 '25 09:03 andreubotella

In #9918, I made the following changes: In StringLiteral, added the lossy field to indicate whether the value contains a lossy replacement character (U+FFFD). In codegen, print the raw string directly if it is lossy. I'm not sure if this information is enough for Nova, but I think it can have a separate code path to handle lossy strings.

Hi, this wouldn't be enough, because even if we know that there's some lone surrogate that has been converted into the replacement character, there wouldn't be a way to get back the original invalid UTF-16.

Oh sorry, forgot to mention that we now have a raw field, or you can slice it from the source text.

Mar 20 '25 09:03 Boshen

Oh sorry, forgot to mention that we now have a raw field, or you can slice it from the source text.

Oh, my bad! In that case, I think that would be enough, although we'd have to make sure our parsing of the raw string matches the spec in everything else (escapes, etc...). Thank you so much for working on this!

Mar 20 '25 09:03 andreubotella

Oh sorry, forgot to mention that we now have a raw field, or you can slice it from the source text.

Oh, my bad! In that case, I think that would be enough, although we'd have to make sure our parsing of the raw string matches the spec in everything else (escapes, etc...). Thank you so much for working on this!

Most of the code is in https://github.com/oxc-project/oxc/blob/main/crates/oxc_parser/src/lexer/unicode.rs, feel free to copy them over.

Mar 20 '25 09:03 Boshen

I tried to get the Test262/ESTree tests related to lone surrogates etc to pass after #9918. Some pass, but some don't.

I've not figured out why yet, but will try to. Re-opening in the meantime.

Mar 21 '25 09:03 overlookmotel

Not fixed yet! But #10041 should do it.

Mar 25 '25 18:03 overlookmotel

oxc oxc copied to clipboard

codegen: string surrogates printed incorrectly

oxc
oxc copied to clipboard