grammars-v4 icon indicating copy to clipboard operation
grammars-v4 copied to clipboard

JSX parser issue with nested elements

Open aleppard opened this issue 1 year ago • 10 comments

Hello,

Thanks for the great project! Thanks @mike-lischke for your work on the JSX parser.

I'm trying to integrate the JSX parser and I think I've found an issue when using nested elements/components on a single line. The following works fine:

function Function(props) {
    return (<div></div>);
}

but if I try to parse the following:

function Function(props) {
   return (<div><div></div></div>);
}

I get the following errors:

line 2:39 no viable alternative at input 'function Function(props) {\n    return (<div><div>hello</div></div>)'
line 1:25 no viable alternative at input '{'
line 2:11 no viable alternative at input '('
line 2:39 mismatched input ')' expecting {'[', '(', '{', '++', '--', '+', '-', '~', '!', '<', 'null', BooleanLiteral, DecimalLiteral, HexIntegerLiteral, OctalIntegerLiteral, OctalIntegerLiteral2, BinaryIntegerLiteral, BigHexIntegerLiteral, BigOctalIntegerLiteral, BigBinaryIntegerLiteral, BigDecimalIntegerLiteral, 'typeof', 'new', 'void', 'function', 'this', 'delete', 'class', 'super', 'import', 'async', 'await', NonStrictLet, 'yield', Identifier, StringLiteral, BackTick, RegularExpressionLiteral}

I think the problem is that it requires a newline character after elements. The following works:

function Function(props) {
   return (<div>
                 <div>
                 </div>
           </div>);
}

Any idea how we can modify the parser to handle the case where there is no newline? Thanks.

aleppard avatar Jan 12 '24 03:01 aleppard

When printing the tokens that are being created for the input

function Function(props) {
   return (<div><div></div></div>);
}

I get this:

  1    Function_                      'function'
  2    WhiteSpaces                    ' '
  3    Identifier                     'Function'
  4    OpenParen                      '('
  5    Identifier                     'props'
  6    CloseParen                     ')'
  7    WhiteSpaces                    ' '
  8    OpenBrace                      '{'
  9    LineTerminator                 '\n'
  10   WhiteSpaces                    '   '
  11   Return                         'return'
  12   WhiteSpaces                    ' '
  13   OpenParen                      '('
  14   LessThan                       '<'
  15   Identifier                     'div'
  16   MoreThan                       '>'
  17   LessThan                       '<'
  18   Identifier                     'div'
  19   MoreThan                       '>'
  20   LessThan                       '<'
  21   RegularExpressionLiteral       '/div></div'
  22   MoreThan                       '>'
  23   CloseParen                     ')'
  24   SemiColon                      ';'
  25   LineTerminator                 '\n'
  26   CloseBrace                     '}'
  27   LineTerminator                 '\n'
  28   EOF                            '<EOF>'

So the issue is that the lexer tokenizes the input /div></div as a RegularExpressionLiteral. And when there is a line break between the /s, it won't become a RegularExpressionLiteral:

  1    Function_                      'function'
  2    WhiteSpaces                    ' '
  3    Identifier                     'Function'
  4    OpenParen                      '('
  5    Identifier                     'props'
  6    CloseParen                     ')'
  7    WhiteSpaces                    ' '
  8    OpenBrace                      '{'
  9    LineTerminator                 '\n'
  10   WhiteSpaces                    '   '
  11   Return                         'return'
  12   WhiteSpaces                    ' '
  13   OpenParen                      '('
  14   LessThan                       '<'
  15   Identifier                     'div'
  16   MoreThan                       '>'
  17   LessThan                       '<'
  18   Identifier                     'div'
  19   MoreThan                       '>'
  20   LessThan                       '<'
  21   Divide                         '/'
  22   Identifier                     'div'
  23   MoreThan                       '>'
  24   LineTerminator                 '\n'
  25   LessThan                       '<'
  26   Divide                         '/'
  27   Identifier                     'div'
  28   MoreThan                       '>'
  29   CloseParen                     ')'
  30   SemiColon                      ';'
  31   LineTerminator                 '\n'
  32   CloseBrace                     '}'
  33   LineTerminator                 '\n'
  34   EOF                            '<EOF>'

bkiers avatar Jan 14 '24 11:01 bkiers

That makes sense! The lexer can't match the regular expression pattern if there is a newline between </div> and another </div> because a newline can't be part of a regular expression:

fragment RegularExpressionChar:
    ~[\r\n\u2028\u2029\\/[]
    | RegularExpressionBackslashSequence
    | '[' RegularExpressionClassChar* ']'

aleppard avatar Jan 15 '24 04:01 aleppard

One potential (but a bit hacky) fix is to add "JavaScriptLexer.LessThan" to the list of tokens in jsx/Java/JavaScriptLexerBase.java's IsRegexPossible function

 /**
     * Returns {@code true} if the lexer can match a regex literal.
     */
    protected boolean IsRegexPossible() {
                                       
        if (this.lastToken == null) {
            // No token has been produced yet: at the start of the input,
            // no division is possible, so a regex literal _is_ possible.
            return true;
        }
        
        switch (this.lastToken.getType()) {
            case JavaScriptLexer.Identifier:
            case JavaScriptLexer.NullLiteral:
            case JavaScriptLexer.BooleanLiteral:
            case JavaScriptLexer.This:
            case JavaScriptLexer.CloseBracket:
            case JavaScriptLexer.CloseParen:
            case JavaScriptLexer.OctalIntegerLiteral:
            case JavaScriptLexer.DecimalLiteral:
            case JavaScriptLexer.HexIntegerLiteral:
            case JavaScriptLexer.StringLiteral:
            case JavaScriptLexer.PlusPlus:
            case JavaScriptLexer.MinusMinus:
            case JavaScriptLexer.LessThan:
                // After any of the tokens above, no regex literal can follow.
                return false;

That would prevent the lexer from matching "< /regex/" fixing this bug but meaning you couldn't compare against a regex. A better solution feels like moving the regex stuff out of the lexer - but I don't really know ANTLR grammars well enough to say.

aleppard avatar Jan 15 '24 05:01 aleppard

Both solutions are not ideal. For a decent JSX parser, the lexer (and parser to some extend) grammar(s) should be rewritten to account for the embedded HTML.

bkiers avatar Jan 15 '24 07:01 bkiers

Would you mind expanding on what you mean by "the lexer grammar should be rewritten to account for the embedded HTML"? Thanks in advance.

aleppard avatar Jan 16 '24 00:01 aleppard

Would you mind expanding on what you mean by "the lexer grammar should be rewritten to account for the embedded HTML"? Thanks in advance.

The lexer should account for HTML tokens (it must be able to create HTML tags (or parts of them), attributes inside tags, etc.). When it in its special HTML mode, it will not be able to pick up two /s as regex literals. When looking in the lexer of the JSX grammars, I see a beginning was already made, but it was never properly finished. All in al, I think it will be quite a bit of work to properly fix things to properly parse all kind of JSX source.

bkiers avatar Jan 16 '24 06:01 bkiers

Is it even possible to write a compliant Jsx parser in Antlr?

If you encounter say <span> I believe you'd want the lexer to identify the tokens like "<", "span" (tag name), and ">". To do this you can use modes, entering a "tag" mode when it encounters the "<" and leaving when it encounters the ">", e.g.

lexer grammar JsxLexer;

LessThan                   : '<';
MoreThan                   : '>';

OpenTagStart  : '<' -> pushMode(HTML_OPEN_TAG);

mode HTML_OPEN_TAG;

TagName : [a-zA-Z]+;
OpenTagStop : '>' -> popMode;

However, this won't work, because the lexer will always match the "<" for rules such as "a < b" and never the OpenTagStart rule (or vice-versa if you re-order the rules).

Or am I missing something here?

Edit: I think I've found a hack. We can do the same thing as done for regular expressions, someting like:

OpenTagStart : '<' {this.IsHtmlTagPossible()}? -> pushMode(HTML_OPEN_TAG); LessThan : '<';

Then it only matches OpenStartTag if the last token is a set of known tokens, e.g. Return, =, (, ?, :, [, Default(?), Yield(?), or {.

aleppard avatar Feb 03 '24 03:02 aleppard

The < as the start of a tag is only possible in certain cases: when the token before it is a = or a ( (and perhaps more cases). In such cases it cannot be a "less than" operator. You should perform such a check before pushing a HTML_OPEN_TAG. Just like the check if / can be a regex delimiter.

bkiers avatar Feb 03 '24 08:02 bkiers

Thanks very much for your help. I've created a PR that fixes this issue:

https://github.com/antlr/grammars-v4/pull/3958

I've taken your advice here and have used the lexer to create separate modes to avoid confusion with the regex and also matched the JSX opening element brace ("<") only when appropriate.

aleppard avatar Feb 06 '24 02:02 aleppard

I only quickly scrolled through your PR, but at a first glance: nice work 👍

bkiers avatar Feb 06 '24 07:02 bkiers