grammars-v4
grammars-v4 copied to clipboard
JSX parser issue with nested elements
Hello,
Thanks for the great project! Thanks @mike-lischke for your work on the JSX parser.
I'm trying to integrate the JSX parser and I think I've found an issue when using nested elements/components on a single line. The following works fine:
function Function(props) {
return (<div></div>);
}
but if I try to parse the following:
function Function(props) {
return (<div><div></div></div>);
}
I get the following errors:
line 2:39 no viable alternative at input 'function Function(props) {\n return (<div><div>hello</div></div>)'
line 1:25 no viable alternative at input '{'
line 2:11 no viable alternative at input '('
line 2:39 mismatched input ')' expecting {'[', '(', '{', '++', '--', '+', '-', '~', '!', '<', 'null', BooleanLiteral, DecimalLiteral, HexIntegerLiteral, OctalIntegerLiteral, OctalIntegerLiteral2, BinaryIntegerLiteral, BigHexIntegerLiteral, BigOctalIntegerLiteral, BigBinaryIntegerLiteral, BigDecimalIntegerLiteral, 'typeof', 'new', 'void', 'function', 'this', 'delete', 'class', 'super', 'import', 'async', 'await', NonStrictLet, 'yield', Identifier, StringLiteral, BackTick, RegularExpressionLiteral}
I think the problem is that it requires a newline character after elements. The following works:
function Function(props) {
return (<div>
<div>
</div>
</div>);
}
Any idea how we can modify the parser to handle the case where there is no newline? Thanks.
When printing the tokens that are being created for the input
function Function(props) {
return (<div><div></div></div>);
}
I get this:
1 Function_ 'function'
2 WhiteSpaces ' '
3 Identifier 'Function'
4 OpenParen '('
5 Identifier 'props'
6 CloseParen ')'
7 WhiteSpaces ' '
8 OpenBrace '{'
9 LineTerminator '\n'
10 WhiteSpaces ' '
11 Return 'return'
12 WhiteSpaces ' '
13 OpenParen '('
14 LessThan '<'
15 Identifier 'div'
16 MoreThan '>'
17 LessThan '<'
18 Identifier 'div'
19 MoreThan '>'
20 LessThan '<'
21 RegularExpressionLiteral '/div></div'
22 MoreThan '>'
23 CloseParen ')'
24 SemiColon ';'
25 LineTerminator '\n'
26 CloseBrace '}'
27 LineTerminator '\n'
28 EOF '<EOF>'
So the issue is that the lexer tokenizes the input /div></div
as a RegularExpressionLiteral
. And when there is a line break between the /
s, it won't become a RegularExpressionLiteral
:
1 Function_ 'function'
2 WhiteSpaces ' '
3 Identifier 'Function'
4 OpenParen '('
5 Identifier 'props'
6 CloseParen ')'
7 WhiteSpaces ' '
8 OpenBrace '{'
9 LineTerminator '\n'
10 WhiteSpaces ' '
11 Return 'return'
12 WhiteSpaces ' '
13 OpenParen '('
14 LessThan '<'
15 Identifier 'div'
16 MoreThan '>'
17 LessThan '<'
18 Identifier 'div'
19 MoreThan '>'
20 LessThan '<'
21 Divide '/'
22 Identifier 'div'
23 MoreThan '>'
24 LineTerminator '\n'
25 LessThan '<'
26 Divide '/'
27 Identifier 'div'
28 MoreThan '>'
29 CloseParen ')'
30 SemiColon ';'
31 LineTerminator '\n'
32 CloseBrace '}'
33 LineTerminator '\n'
34 EOF '<EOF>'
That makes sense! The lexer can't match the regular expression pattern if there is a newline between </div> and another </div> because a newline can't be part of a regular expression:
fragment RegularExpressionChar:
~[\r\n\u2028\u2029\\/[]
| RegularExpressionBackslashSequence
| '[' RegularExpressionClassChar* ']'
One potential (but a bit hacky) fix is to add "JavaScriptLexer.LessThan" to the list of tokens in jsx/Java/JavaScriptLexerBase.java's IsRegexPossible function
/**
* Returns {@code true} if the lexer can match a regex literal.
*/
protected boolean IsRegexPossible() {
if (this.lastToken == null) {
// No token has been produced yet: at the start of the input,
// no division is possible, so a regex literal _is_ possible.
return true;
}
switch (this.lastToken.getType()) {
case JavaScriptLexer.Identifier:
case JavaScriptLexer.NullLiteral:
case JavaScriptLexer.BooleanLiteral:
case JavaScriptLexer.This:
case JavaScriptLexer.CloseBracket:
case JavaScriptLexer.CloseParen:
case JavaScriptLexer.OctalIntegerLiteral:
case JavaScriptLexer.DecimalLiteral:
case JavaScriptLexer.HexIntegerLiteral:
case JavaScriptLexer.StringLiteral:
case JavaScriptLexer.PlusPlus:
case JavaScriptLexer.MinusMinus:
case JavaScriptLexer.LessThan:
// After any of the tokens above, no regex literal can follow.
return false;
That would prevent the lexer from matching "< /regex/" fixing this bug but meaning you couldn't compare against a regex. A better solution feels like moving the regex stuff out of the lexer - but I don't really know ANTLR grammars well enough to say.
Both solutions are not ideal. For a decent JSX parser, the lexer (and parser to some extend) grammar(s) should be rewritten to account for the embedded HTML.
Would you mind expanding on what you mean by "the lexer grammar should be rewritten to account for the embedded HTML"? Thanks in advance.
Would you mind expanding on what you mean by "the lexer grammar should be rewritten to account for the embedded HTML"? Thanks in advance.
The lexer should account for HTML tokens (it must be able to create HTML tags (or parts of them), attributes inside tags, etc.). When it in its special HTML mode, it will not be able to pick up two /
s as regex literals. When looking in the lexer of the JSX grammars, I see a beginning was already made, but it was never properly finished. All in al, I think it will be quite a bit of work to properly fix things to properly parse all kind of JSX source.
Is it even possible to write a compliant Jsx parser in Antlr?
If you encounter say <span>
I believe you'd want the lexer to identify the tokens like "<", "span" (tag name), and ">". To do this you can use modes, entering a "tag" mode when it encounters the "<" and leaving when it encounters the ">", e.g.
lexer grammar JsxLexer;
LessThan : '<';
MoreThan : '>';
OpenTagStart : '<' -> pushMode(HTML_OPEN_TAG);
mode HTML_OPEN_TAG;
TagName : [a-zA-Z]+;
OpenTagStop : '>' -> popMode;
However, this won't work, because the lexer will always match the "<" for rules such as "a < b" and never the OpenTagStart rule (or vice-versa if you re-order the rules).
Or am I missing something here?
Edit: I think I've found a hack. We can do the same thing as done for regular expressions, someting like:
OpenTagStart : '<' {this.IsHtmlTagPossible()}? -> pushMode(HTML_OPEN_TAG); LessThan : '<';
Then it only matches OpenStartTag if the last token is a set of known tokens, e.g. Return, =, (, ?, :, [, Default(?), Yield(?), or {.
The <
as the start of a tag is only possible in certain cases: when the token before it is a =
or a (
(and perhaps more cases). In such cases it cannot be a "less than" operator. You should perform such a check before pushing a HTML_OPEN_TAG
. Just like the check if /
can be a regex delimiter.
Thanks very much for your help. I've created a PR that fixes this issue:
https://github.com/antlr/grammars-v4/pull/3958
I've taken your advice here and have used the lexer to create separate modes to avoid confusion with the regex and also matched the JSX opening element brace ("<") only when appropriate.
I only quickly scrolled through your PR, but at a first glance: nice work 👍