ast: Incomplete HTML entity handling for `JSXText` and `JSXAttributeValue`
I found 2 mismatches in the AST of JSX generated by ESTree-compatible parsers such as Acorn(+acorn-jsx).
- (1) The
JSXTextnode does not have arawproperty - (2) Neither
JSXAttributeValuenorJSXTextprocesses HTML entities
For (1), the solution is simply to add the property. The value should be the same as that of a StringLiteral, using the Span position.
👉🏻 UPDATE: (1) is now fixed in #9641
For (2), the situation is a bit more complex, adjustments are needed for:
- The
valueofJSXAttributeValue::StringLiteral - The
valueofJSXText
In these cases, it is necessary to assign the value with HTML entities resolved.
http://facebook.github.io/jsx/#sec-jsx-JSXString-SV
Currently, OXC seems to handle this in a transformer following TypeScript?
However, to output an AST identical to Acorn(+arcorn-jsx), this needs to be performed during the parsing.
@Boshen Do you see a problem with decoding HTML entities in parser? Personally, I think it makes a certain amount of sense - feels similar to how we decode escape sequences in string literals etc.
@Boshen Do you see a problem with decoding HTML entities in parser? Personally, I think it makes a certain amount of sense - feels similar to how we decode escape sequences in string literals etc.
acorn and babel decodes that in the parser:
https://github.com/acornjs/acorn-jsx/blob/8ed96d6ddec2065204ba07d924bb2e7bca539ea6/index.js#L220
I aligned our implementation to tsc, which decodes in the transformer:
https://github.com/oxc-project/oxc/blob/f2b0cc1a97a80e821c97fe2a7faabaed376d49b7/crates/oxc_transformer/src/jsx/jsx_impl.rs#L967
https://github.com/oxc-project/oxc/blob/f2b0cc1a97a80e821c97fe2a7faabaed376d49b7/crates/oxc_transformer/src/jsx/jsx_impl.rs#L996-L999
I don't have strong preferences.
The only downside I can see of decoding in parser is that we'll have to encode again in oxc_codegen when printing JSX. Or can we just print the raw field if it's Some?
But printing JSX (without transforming it) is a bit of a niche use case anyway, as far as I'm aware, so it shouldn't hurt perf in common case.
If JSX is transformed, we don't create any more work, we just move the work. It may even be faster to do it in lexer when we're searching through characters of the source anyway.
There is one annoyance. TS-ESLint decodes HTML entities in the value field of JSXText, but it doesn't remove excess whitespace.
So if we want to align with TS-ESLint, we'd need to do HTML entity decoding in parser/lexer, and then whitespace removal in JSX transform. That's a pity, as it'd be more efficient to do both at the same time.
I double checked all the existing parsers, all of them decode expect tsc. I'll split it - decode in lexer, whitespace in transformer.
One question: In codegen, what characters do we need/want to encode as &...;?
I guess shortest legal output is only to encode <, >, {, and }.
@overlookmotel I can no longer figure out the proper way of doing this. I'll leave this to you.
You need to move
https://github.com/oxc-project/oxc/blob/c6312915dfb377a0f6aad8d9b4beb04d7eccd780/crates/oxc_transformer/src/jsx/jsx_impl.rs#L996-L1046
to
https://github.com/oxc-project/oxc/blob/c6312915dfb377a0f6aad8d9b4beb04d7eccd780/crates/oxc_parser/src/lexer/jsx.rs#L79-L99
with a new method read_jsx_text.
One question: In codegen, what characters do we need/want to encode as
&...;?I guess shortest legal output is only to encode
<,>,{, and}.
Need to revert all the encodes ... just print raw?
Personally, I don't think we should just print raw. If you're constructing an AST programatically, same as you don't need to manually escape \n when you create a StringLiteral, I don't think you should have to escape < when creating a JSXText node. Oxc should take care of these details for you, so it's impossible to generate invalid output.
#10324 threw up a couple of obscure bugs:
<div>a 
 b</div>should be transformed to_jsx("div", { children: "a b" })not_jsx("div", { children: "a \n b" })(because
is\n, and line breaks are removed).- Surrogate pairs are legal in JSX! e.g.
<div>��</div>->_jsx("div", { children: "𐀀" }).
Should be able to fix these while moving decoding escapes into the lexer.