nearley Very slow performance for lua-like grammer

Firstly - nearley is awesome 👍🏽, I was able to create a working grammer in no time at all! The biggest problem so far has been performance.

I'm parsing a lua-like grammer - it's almost identical to the example one, but I added a few extra features, and I'm using moo as my lexer.

It's taking around 26 seconds to parse the input [. - which is an invalid grammer. I'm taking a user's input, so sometimes they enter invalid stuff like that. Is there a way to speed this up, or how would I got about debugging this performance issue?

real	0m26.580s
user	0m18.580s
sys	0m1.570s

This is the output of the debug:

Error: Syntax error at line 1 col 2:

  [.
   ^
Unexpected dot token: ".". Instead, I was expecting to see one of the following:

A rbracket token based on:
    Array → %lbracket _ ● %rbracket
    Atom →  ● Array
    ExpPow →  ● Atom
    ExpUnary →  ● ExpPow
    ExpProduct →  ● ExpUnary
    ExpSum →  ● ExpProduct
    ExpConcatenation →  ● ExpSum _ %concat _ ExpConcatenation
    ExpComparison →  ● ExpConcatenation
    ExpAnd →  ● ExpComparison
    ExpOr →  ● ExpAnd
    Exp →  ● ExpOr
A ws token based on:
    _ → _ ● %ws
    Array → %lbracket ● _ ExpList _ %rbracket
    Atom →  ● Array
    ExpPow →  ● Atom
    ExpUnary →  ● ExpPow
    ExpProduct →  ● ExpUnary
    ExpSum →  ● ExpProduct
    ExpConcatenation →  ● ExpSum _ %concat _ ExpConcatenation
    ExpComparison →  ● ExpConcatenation
    ExpAnd →  ● ExpComparison
    ExpOr →  ● ExpAnd
    Exp →  ● ExpOr
A not token based on:
    ExpUnary →  ● %not __ ExpPow
    ExpProduct →  ● ExpUnary
    ExpSum →  ● ExpProduct
    ExpConcatenation →  ● ExpSum _ %concat _ ExpConcatenation
    ExpComparison →  ● ExpConcatenation
    ExpAnd →  ● ExpComparison
    ExpOr →  ● ExpAnd
    Exp →  ● ExpOr
    ExpList →  ● Exp
    Array → %lbracket _ ● ExpList _ %rbracket
    Atom →  ● Array
    ExpPow →  ● Atom
    ExpUnary →  ● ExpPow
    ExpProduct →  ● ExpUnary
    ExpSum →  ● ExpProduct
    ExpConcatenation →  ● ExpSum _ %concat _ ExpConcatenation
    ExpComparison →  ● ExpConcatenation
    ExpAnd →  ● ExpComparison
    ExpOr →  ● ExpAnd
    Exp →  ● ExpOr
A bang token based on:
    ExpUnary →  ● %bang ExpPow
    ExpProduct →  ● ExpUnary
    ExpSum →  ● ExpProduct
    ExpConcatenation →  ● ExpSum _ %concat _ ExpConcatenation
    ExpComparison →  ● ExpConcatenation
    ExpAnd →  ● ExpComparison
    ExpOr →  ● ExpAnd
    Exp →  ● ExpOr
    ExpList →  ● Exp
    Array → %lbracket _ ● ExpList _ %rbracket
    Atom →  ● Array
    ExpPow →  ● Atom
    ExpUnary →  ● ExpPow
    ExpProduct →  ● ExpUnary
    ExpSum →  ● ExpProduct
    ExpConcatenation →  ● ExpSum _ %concat _ ExpConcatenation
    ExpComparison →  ● ExpConcatenation
    ExpAnd →  ● ExpComparison
    ExpOr →  ● ExpAnd
    Exp →  ● ExpOr
A minus token based on:
    ExpUnary →  ● %minus _ ExpPow
    ExpProduct →  ● ExpUnary
    ExpSum →  ● ExpProduct
    ExpConcatenation →  ● ExpSum _ %concat _ ExpConcatenation
    ExpComparison →  ● ExpConcatenation
    ExpAnd →  ● ExpComparison
    ExpOr →  ● ExpAnd
    Exp →  ● ExpOr
    ExpList →  ● Exp
    Array → %lbracket _ ● ExpList _ %rbracket
    Atom →  ● Array
    ExpPow →  ● Atom
    ExpUnary →  ● ExpPow
    ExpProduct →  ● ExpUnary
    ExpSum →  ● ExpProduct
    ExpConcatenation →  ● ExpSum _ %concat _ ExpConcatenation
    ExpComparison →  ● ExpConcatenation
    ExpAnd →  ● ExpComparison
    ExpOr →  ● ExpAnd
    Exp →  ● ExpOr
A string token based on:
    Atom →  ● %string
    ExpPow →  ● Atom
    ExpUnary →  ● ExpPow
    ExpProduct →  ● ExpUnary
    ExpSum →  ● ExpProduct
    ExpConcatenation →  ● ExpSum _ %concat _ ExpConcatenation
    ExpComparison →  ● ExpConcatenation
    ExpAnd →  ● ExpComparison
    ExpOr →  ● ExpAnd
    Exp →  ● ExpOr
    ExpList →  ● Exp
    Array → %lbracket _ ● ExpList _ %rbracket
    Atom →  ● Array
    ExpPow →  ● Atom
    ExpUnary →  ● ExpPow
    ExpProduct →  ● ExpUnary
    ExpSum →  ● ExpProduct
    ExpConcatenation →  ● ExpSum _ %concat _ ExpConcatenation
    ExpComparison →  ● ExpConcatenation
    ExpAnd →  ● ExpComparison
    ExpOr →  ● ExpAnd
    Exp →  ● ExpOr
A boolean token based on:
    Atom →  ● %boolean
    ExpPow →  ● Atom
    ExpUnary →  ● ExpPow
    ExpProduct →  ● ExpUnary
    ExpSum →  ● ExpProduct
    ExpConcatenation →  ● ExpSum _ %concat _ ExpConcatenation
    ExpComparison →  ● ExpConcatenation
    ExpAnd →  ● ExpComparison
    ExpOr →  ● ExpAnd
    Exp →  ● ExpOr
    ExpList →  ● Exp
    Array → %lbracket _ ● ExpList _ %rbracket
    Atom →  ● Array
    ExpPow →  ● Atom
    ExpUnary →  ● ExpPow
    ExpProduct →  ● ExpUnary
    ExpSum →  ● ExpProduct
    ExpConcatenation →  ● ExpSum _ %concat _ ExpConcatenation
    ExpComparison →  ● ExpConcatenation
    ExpAnd →  ● ExpComparison
    ExpOr →  ● ExpAnd
    Exp →  ● ExpOr
A number token based on:
    Atom →  ● %number
    ExpPow →  ● Atom
    ExpUnary →  ● ExpPow
    ExpProduct →  ● ExpUnary
    ExpSum →  ● ExpProduct
    ExpConcatenation →  ● ExpSum _ %concat _ ExpConcatenation
    ExpComparison →  ● ExpConcatenation
    ExpAnd →  ● ExpComparison
    ExpOr →  ● ExpAnd
    Exp →  ● ExpOr
    ExpList →  ● Exp
    Array → %lbracket _ ● ExpList _ %rbracket
    Atom →  ● Array
    ExpPow →  ● Atom
    ExpUnary →  ● ExpPow
    ExpProduct →  ● ExpUnary
    ExpSum →  ● ExpProduct
    ExpConcatenation →  ● ExpSum _ %concat _ ExpConcatenation
    ExpComparison →  ● ExpConcatenation
    ExpAnd →  ● ExpComparison
    ExpOr →  ● ExpAnd
    Exp →  ● ExpOr
A lbracket token based on:
    Array →  ● %lbracket _ %rbracket
    Atom →  ● Array
    ExpPow →  ● Atom
    ExpUnary →  ● ExpPow
    ExpProduct →  ● ExpUnary
    ExpSum →  ● ExpProduct
    ExpConcatenation →  ● ExpSum _ %concat _ ExpConcatenation
    ExpComparison →  ● ExpConcatenation
    ExpAnd →  ● ExpComparison
    ExpOr →  ● ExpAnd
    Exp →  ● ExpOr
    ExpList →  ● Exp
    Array → %lbracket _ ● ExpList _ %rbracket
    Atom →  ● Array
    ExpPow →  ● Atom
    ExpUnary →  ● ExpPow
    ExpProduct →  ● ExpUnary
    ExpSum →  ● ExpProduct
    ExpConcatenation →  ● ExpSum _ %concat _ ExpConcatenation
    ExpComparison →  ● ExpConcatenation
    ExpAnd →  ● ExpComparison
    ExpOr →  ● ExpAnd
    Exp →  ● ExpOr
A lbracket token based on:
    Array →  ● %lbracket _ ExpList _ %rbracket
    Atom →  ● Array
    ExpPow →  ● Atom
    ExpUnary →  ● ExpPow
    ExpProduct →  ● ExpUnary
    ExpSum →  ● ExpProduct
    ExpConcatenation →  ● ExpSum _ %concat _ ExpConcatenation
    ExpComparison →  ● ExpConcatenation
    ExpAnd →  ● ExpComparison
    ExpOr →  ● ExpAnd
    Exp →  ● ExpOr
    ExpList →  ● Exp
    Array → %lbracket _ ● ExpList _ %rbracket
    Atom →  ● Array
    ExpPow →  ● Atom
    ExpUnary →  ● ExpPow
    ExpProduct →  ● ExpUnary
    ExpSum →  ● ExpProduct
    ExpConcatenation →  ● ExpSum _ %concat _ ExpConcatenation
    ExpComparison →  ● ExpConcatenation
    ExpAnd →  ● ExpComparison
    ExpOr →  ● ExpAnd
    Exp →  ● ExpOr
A lcurly token based on:
    Object →  ● %lcurly _ %rcurly
    Atom →  ● Object
    ExpPow →  ● Atom
    ExpUnary →  ● ExpPow
    ExpProduct →  ● ExpUnary
    ExpSum →  ● ExpProduct
    ExpConcatenation →  ● ExpSum _ %concat _ ExpConcatenation
    ExpComparison →  ● ExpConcatenation
    ExpAnd →  ● ExpComparison
    ExpOr →  ● ExpAnd
    Exp →  ● ExpOr
    ExpList →  ● Exp
    Array → %lbracket _ ● ExpList _ %rbracket
    Atom →  ● Array
    ExpPow →  ● Atom
    ExpUnary →  ● ExpPow
    ExpProduct →  ● ExpUnary
    ExpSum →  ● ExpProduct
    ExpConcatenation →  ● ExpSum _ %concat _ ExpConcatenation
    ExpComparison →  ● ExpConcatenation
    ExpAnd →  ● ExpComparison
    ExpOr →  ● ExpAnd
    Exp →  ● ExpOr
A lcurly token based on:
    Object →  ● %lcurly _ KeyValueList _ %rcurly
    Atom →  ● Object
    ExpPow →  ● Atom
    ExpUnary →  ● ExpPow
    ExpProduct →  ● ExpUnary
    ExpSum →  ● ExpProduct
    ExpConcatenation →  ● ExpSum _ %concat _ ExpConcatenation
    ExpComparison →  ● ExpConcatenation
    ExpAnd →  ● ExpComparison
    ExpOr →  ● ExpAnd
    Exp →  ● ExpOr
    ExpList →  ● Exp
    Array → %lbracket _ ● ExpList _ %rbracket
    Atom →  ● Array
    ExpPow →  ● Atom
    ExpUnary →  ● ExpPow
    ExpProduct →  ● ExpUnary
    ExpSum →  ● ExpProduct
    ExpConcatenation →  ● ExpSum _ %concat _ ExpConcatenation
    ExpComparison →  ● ExpConcatenation
    ExpAnd →  ● ExpComparison
    ExpOr →  ● ExpAnd
    Exp →  ● ExpOr
A lparen token based on:
    Parenthesized →  ● %lparen _ Exp _ %rparen
    Atom →  ● Parenthesized
    ExpPow →  ● Atom
    ExpUnary →  ● ExpPow
    ExpProduct →  ● ExpUnary
    ExpSum →  ● ExpProduct
    ExpConcatenation →  ● ExpSum _ %concat _ ExpConcatenation
    ExpComparison →  ● ExpConcatenation
    ExpAnd →  ● ExpComparison
    ExpOr →  ● ExpAnd
    Exp →  ● ExpOr
    ExpList →  ● Exp
    Array → %lbracket _ ● ExpList _ %rbracket
    Atom →  ● Array
    ExpPow →  ● Atom
    ExpUnary →  ● ExpPow
    ExpProduct →  ● ExpUnary
    ExpSum →  ● ExpProduct
    ExpConcatenation →  ● ExpSum _ %concat _ ExpConcatenation
    ExpComparison →  ● ExpConcatenation
    ExpAnd →  ● ExpComparison
    ExpOr →  ● ExpAnd
    Exp →  ● ExpOr
A name token based on:
    ExpName →  ● %name
    Var →  ● ExpName
    PrefixExp →  ● Var
    Atom →  ● PrefixExp
    ExpPow →  ● Atom
    ExpUnary →  ● ExpPow
    ExpProduct →  ● ExpUnary
    ExpSum →  ● ExpProduct
    ExpConcatenation →  ● ExpSum _ %concat _ ExpConcatenation
    ExpComparison →  ● ExpConcatenation
    ExpAnd →  ● ExpComparison
    ExpOr →  ● ExpAnd
    Exp →  ● ExpOr
    ExpList →  ● Exp
    Array → %lbracket _ ● ExpList _ %rbracket
    Atom →  ● Array
    ExpPow →  ● Atom
    ExpUnary →  ● ExpPow
    ExpProduct →  ● ExpUnary
    ExpSum →  ● ExpProduct
    ExpConcatenation →  ● ExpSum _ %concat _ ExpConcatenation
    ExpComparison →  ● ExpConcatenation
    ExpAnd →  ● ExpComparison
    ExpOr →  ● ExpAnd
    Exp →  ● ExpOr
A reference token based on:
    Reference →  ● %reference
    ExpName →  ● Reference
    Var →  ● ExpName
    PrefixExp →  ● Var
    Atom →  ● PrefixExp
    ExpPow →  ● Atom
    ExpUnary →  ● ExpPow
    ExpProduct →  ● ExpUnary
    ExpSum →  ● ExpProduct
    ExpConcatenation →  ● ExpSum _ %concat _ ExpConcatenation
    ExpComparison →  ● ExpConcatenation
    ExpAnd →  ● ExpComparison
    ExpOr →  ● ExpAnd
    Exp →  ● ExpOr
    ExpList →  ● Exp
    Array → %lbracket _ ● ExpList _ %rbracket
    Atom →  ● Array
    ExpPow →  ● Atom
    ExpUnary →  ● ExpPow
    ExpProduct →  ● ExpUnary
    ExpSum →  ● ExpProduct
    ExpConcatenation →  ● ExpSum _ %concat _ ExpConcatenation
    ExpComparison →  ● ExpConcatenation
    ExpAnd →  ● ExpComparison
    ExpOr →  ● ExpAnd
    Exp →  ● ExpOr
A reference token based on:
    Reference →  ● %reference _ ReferenceSelectorList
    ExpName →  ● Reference
    Var →  ● ExpName
    PrefixExp →  ● Var
    Atom →  ● PrefixExp
    ExpPow →  ● Atom
    ExpUnary →  ● ExpPow
    ExpProduct →  ● ExpUnary
    ExpSum →  ● ExpProduct
    ExpConcatenation →  ● ExpSum _ %concat _ ExpConcatenation
    ExpComparison →  ● ExpConcatenation
    ExpAnd →  ● ExpComparison
    ExpOr →  ● ExpAnd
    Exp →  ● ExpOr
    ExpList →  ● Exp
    Array → %lbracket _ ● ExpList _ %rbracket
    Atom →  ● Array
    ExpPow →  ● Atom
    ExpUnary →  ● ExpPow
    ExpProduct →  ● ExpUnary
    ExpSum →  ● ExpProduct
    ExpConcatenation →  ● ExpSum _ %concat _ ExpConcatenation
    ExpComparison →  ● ExpConcatenation
    ExpAnd →  ● ExpComparison
    ExpOr →  ● ExpAnd
    Exp →  ● ExpOr

Oct 18 '19 17:10 calummoore

So I've found the part of the code that was taking a long time - it's actually spending almost all of the time creating the error "state stack". If I comment out the code (as below), the timing improves to 0.125s (from ~26 seconds) - not a bad improvement!

I actually only care about the error position (which I still get when commenting out the below). Maybe we can add an option to skip this extra error information? - as it takes a very long time to build.

real	0m0.125s
user	0m0.088s
sys	0m0.031s

Parser.prototype.reportError = function(token) {
        var lines = [];
        var tokenDisplay = (token.type ? token.type + " token: " : "") + JSON.stringify(token.value !== undefined ? token.value : token);
        lines.push(this.lexer.formatError(token, "Syntax error"));
        lines.push('Unexpected ' + tokenDisplay + '. Instead, I was expecting to see one of the following:\n');
        var lastColumnIndex = this.table.length - 2;
        var lastColumn = this.table[lastColumnIndex];
        var expectantStates = lastColumn.states
            .filter(function(state) {
                var nextSymbol = state.rule.symbols[state.dot];
                return nextSymbol && typeof nextSymbol !== "string";
            });
        
        // // Display a "state stack" for each expectant state
        // // - which shows you how this state came to be, step by step. 
        // // If there is more than one derivation, we only display the first one.
        // var stateStacks = expectantStates
        //     .map(function(state) {
        //         var stacks = this.buildStateStacks(state, []);
        //         return stacks[0];
        //     }, this);
        // // Display each state that is expecting a terminal symbol next.
        // stateStacks.forEach(function(stateStack) {
        //     var state = stateStack[0];
        //     var nextSymbol = state.rule.symbols[state.dot];
        //     var symbolDisplay = this.getSymbolDisplay(nextSymbol);
        //     lines.push('A ' + symbolDisplay + ' based on:');
        //     this.displayStateStack(stateStack, lines);
        // }, this);
            
        lines.push("");
        return lines.join("\n");
    };

Oct 18 '19 20:10 calummoore

Hmm interesting, it appears the fancy error reporting introduced some kind of performance regression? @airportyh do you know what's wrong?

Oct 19 '19 04:10 kach

Wow, that's definitely not intentional. @calummoore would you mind giving me the grammar you are using that exhibited this performance penalty? I'd like to look at it.

Oct 28 '19 19:10 airportyh

I am seeing a very similar issue where reportError takes 20 seconds to build the list of expected tokens, but for a different, probably much simpler grammar.

I am using the following grammar:

@lexer lexer

exp -> boolexp
boolexp ->
    boolexp %And comp
  | boolexp %Or comp
  | comp
comp ->
    comp %Equal sum
  | comp %NotEqual sum
  | comp %LessThan sum
  | comp %LessThanEqual sum
  | comp %GreaterThan sum
  | comp %GreaterThanEqual sum
  | sum
sum ->
    sum %Plus product
  | sum %Minus product
  | product
product ->
    product %Star unaryexp
  | product %Slash unaryexp
  | unaryexp
unaryexp ->
    %Minus unaryexp
  | %Not unaryexp
  | atomicexp
atomicexp ->
    intconst
  # TODO: variables
  # TODO: call
  | %LBrace boolexp %RBrace

intconst -> %IntegerConst

As I am using a custom lexer, I guess you can't directly use it, but I hope the overall structure will be sufficient to debug this issue.

Furthermore, here is a perftrace from the Chrome DevTools slowtrace and the raw JSON report here which can be opened by Chrome here: slow-reportError.zip

From here on, I am mostly guessing as I am not familiar with Nearley or the error reporting algorithm. To me, this looks like an issue with too much recursion/too large fanout. buildParseStack calls itself recursively with a maximum call depth of ~15. The way I wrote my rules, I am probably introducing a large fanout on every recursion level. E.g., assuming that the error position is before the first token, I guess that

    comp %Equal sum
  | comp %NotEqual sum
  | comp %LessThan sum
  | comp %LessThanEqual sum
  | comp %GreaterThan sum
  | comp %GreaterThanEqual sum
  | sum

will translate into 7 rules and buildStacks calls itself recursively for each one of them, even though the first 6 rules would all result in the same valid tokens. In the error message, this duplication is not present anymore, so I guess a later step removes duplicated rules again.

FYI: The error message in my case is

A Minus token based on:
    unaryexp →  ● %Minus unaryexp
    product →  ● unaryexp
    sum →  ● product
    comp →  ● sum
    boolexp →  ● comp
    exp →  ● boolexp
A Not token based on:
    unaryexp →  ● %Not unaryexp
    product →  ● unaryexp
    sum →  ● product
    comp →  ● sum
    boolexp →  ● comp
    exp →  ● boolexp
A LBrace token based on:
    atomicexp →  ● %LBrace boolexp %RBrace
    unaryexp →  ● atomicexp
    product →  ● unaryexp
    sum →  ● product
    comp →  ● sum
    boolexp →  ● comp
    exp →  ● boolexp
A IntegerConst token based on:
    intconst →  ● %IntegerConst
    atomicexp →  ● intconst
    unaryexp →  ● atomicexp
    product →  ● unaryexp
    sum →  ● product
    comp →  ● sum
    boolexp →  ● comp
    exp →  ● boolexp
A FloatConst token based on:
    floatconst →  ● %FloatConst
    atomicexp →  ● floatconst
    unaryexp →  ● atomicexp
    product →  ● unaryexp
    sum →  ● product
    comp →  ● sum
    boolexp →  ● comp
    exp →  ● boolexp
A StringConst token based on:
    stringconst →  ● %StringConst
    atomicexp →  ● stringconst
    unaryexp →  ● atomicexp

Dec 03 '19 23:12 vogelsgesang

After slightly refactoring my grammar, buildStateStacks now executes in less than 1 millisecond. Restructuring the grammar in the following way works around the issues

@lexer lexer

exp -> boolexp
boolexp ->
    boolexp boolop comp
  | comp
comp ->
    comp compop sum
  | sum
sum ->
    sum sumop product
  | product
product ->
    product productop unaryexp
  | unaryexp
unaryexp ->
    unaryop unaryexp
  | atomicexp
atomicexp ->
    intconst
  | floatconst
  | stringconst
  | %LBrace boolexp %RBrace

boolop -> %And | %Or
compop -> %Equal | %NotEqual | %LessThan | %LessThanEqual | %GreaterThan | %GreaterThanEqual
sumop -> %Plus | %Minus
productop -> %Star | %Slash
unaryop -> %Minus %Not

intconst -> %IntegerConst
floatconst -> %FloatConst
stringconst -> %StringConst

Dec 03 '19 23:12 vogelsgesang

nearley nearley copied to clipboard

Very slow performance for lua-like grammer

nearley
nearley copied to clipboard