oxc icon indicating copy to clipboard operation
oxc copied to clipboard

ast: Incomplete HTML entity handling for `JSXText` and `JSXAttributeValue`

Open leaysgur opened this issue 9 months ago • 10 comments

I found 2 mismatches in the AST of JSX generated by ESTree-compatible parsers such as Acorn(+acorn-jsx).

  • (1) The JSXText node does not have a raw property
  • (2) Neither JSXAttributeValue nor JSXText processes HTML entities

For (1), the solution is simply to add the property. The value should be the same as that of a StringLiteral, using the Span position.

👉🏻 UPDATE: (1) is now fixed in #9641


For (2), the situation is a bit more complex, adjustments are needed for:

  • The value of JSXAttributeValue::StringLiteral
  • The value of JSXText

In these cases, it is necessary to assign the value with HTML entities resolved.

http://facebook.github.io/jsx/#sec-jsx-JSXString-SV

Currently, OXC seems to handle this in a transformer following TypeScript?

However, to output an AST identical to Acorn(+arcorn-jsx), this needs to be performed during the parsing.

Image

leaysgur avatar Mar 11 '25 03:03 leaysgur

@Boshen Do you see a problem with decoding HTML entities in parser? Personally, I think it makes a certain amount of sense - feels similar to how we decode escape sequences in string literals etc.

overlookmotel avatar Mar 13 '25 05:03 overlookmotel

@Boshen Do you see a problem with decoding HTML entities in parser? Personally, I think it makes a certain amount of sense - feels similar to how we decode escape sequences in string literals etc.

acorn and babel decodes that in the parser:

https://github.com/acornjs/acorn-jsx/blob/8ed96d6ddec2065204ba07d924bb2e7bca539ea6/index.js#L220

I aligned our implementation to tsc, which decodes in the transformer:

https://github.com/oxc-project/oxc/blob/f2b0cc1a97a80e821c97fe2a7faabaed376d49b7/crates/oxc_transformer/src/jsx/jsx_impl.rs#L967

https://github.com/oxc-project/oxc/blob/f2b0cc1a97a80e821c97fe2a7faabaed376d49b7/crates/oxc_transformer/src/jsx/jsx_impl.rs#L996-L999

I don't have strong preferences.

Boshen avatar Mar 13 '25 05:03 Boshen

The only downside I can see of decoding in parser is that we'll have to encode again in oxc_codegen when printing JSX. Or can we just print the raw field if it's Some?

But printing JSX (without transforming it) is a bit of a niche use case anyway, as far as I'm aware, so it shouldn't hurt perf in common case.

If JSX is transformed, we don't create any more work, we just move the work. It may even be faster to do it in lexer when we're searching through characters of the source anyway.

overlookmotel avatar Mar 13 '25 06:03 overlookmotel

There is one annoyance. TS-ESLint decodes HTML entities in the value field of JSXText, but it doesn't remove excess whitespace.

AST explorer

So if we want to align with TS-ESLint, we'd need to do HTML entity decoding in parser/lexer, and then whitespace removal in JSX transform. That's a pity, as it'd be more efficient to do both at the same time.

overlookmotel avatar Mar 13 '25 09:03 overlookmotel

I double checked all the existing parsers, all of them decode expect tsc. I'll split it - decode in lexer, whitespace in transformer.

Boshen avatar Mar 19 '25 12:03 Boshen

One question: In codegen, what characters do we need/want to encode as &...;?

I guess shortest legal output is only to encode <, >, {, and }.

overlookmotel avatar Mar 19 '25 17:03 overlookmotel

@overlookmotel I can no longer figure out the proper way of doing this. I'll leave this to you.

You need to move

https://github.com/oxc-project/oxc/blob/c6312915dfb377a0f6aad8d9b4beb04d7eccd780/crates/oxc_transformer/src/jsx/jsx_impl.rs#L996-L1046

to

https://github.com/oxc-project/oxc/blob/c6312915dfb377a0f6aad8d9b4beb04d7eccd780/crates/oxc_parser/src/lexer/jsx.rs#L79-L99

with a new method read_jsx_text.

Boshen avatar Mar 20 '25 04:03 Boshen

One question: In codegen, what characters do we need/want to encode as &...;?

I guess shortest legal output is only to encode <, >, {, and }.

Need to revert all the encodes ... just print raw?

Boshen avatar Mar 20 '25 04:03 Boshen

Personally, I don't think we should just print raw. If you're constructing an AST programatically, same as you don't need to manually escape \n when you create a StringLiteral, I don't think you should have to escape < when creating a JSXText node. Oxc should take care of these details for you, so it's impossible to generate invalid output.

overlookmotel avatar Mar 20 '25 11:03 overlookmotel

#10324 threw up a couple of obscure bugs:

  • <div>a &#xA; b</div> should be transformed to _jsx("div", { children: "a b" }) not _jsx("div", { children: "a \n b" }) (because &#xA; is \n, and line breaks are removed).
  • Surrogate pairs are legal in JSX! e.g. <div>&#xD800;&#xDC00;</div> -> _jsx("div", { children: "𐀀" }).

Should be able to fix these while moving decoding escapes into the lexer.

overlookmotel avatar Apr 08 '25 23:04 overlookmotel