Arpeggio icon indicating copy to clipboard operation
Arpeggio copied to clipboard

performance issue when parsing => grammar not good?

Open cons0l3 opened this issue 3 years ago • 7 comments

Dear all,

first of all: nice work! keep it up.

I am currently working on an expression parser (arpeggio 1.10.1) that mimics python expressions:

from arpeggio import *
from arpeggio import RegExMatch as _

class SuppressStrMatch(StrMatch):
    suppress = True

def comment():          return [_("//.*"), _("/\*.*\*/")] 

def expression():       return Optional(disjunction), EOF

def disjunction():      return [(conjunction, OneOrMore( Kwd("OR"), conjunction ) ), conjunction]
def conjunction():      return [(inversion, OneOrMore( Kwd("AND"), inversion) ), inversion]
def inversion():        return [ (Kwd('NOT'), inversion), comparison]
def comparison():       return [ compare_op, plus]
def compare_op():       return [ binaryOp, notin, element_in, isnot, element_is ]
def binaryOp():         return plus, ['==','!=','<=','<','>=','>'], plus
def notin():            return plus, 'not', 'in', SuppressStrMatch("["), disjunction, ZeroOrMore(SuppressStrMatch(","), disjunction), SuppressStrMatch("]")
def element_in():       return plus, 'in', SuppressStrMatch("["), disjunction, ZeroOrMore(SuppressStrMatch(","), disjunction), SuppressStrMatch("]")
def isnot():            return plus, 'is', 'not', plus
def element_is():       return plus, 'is', plus
def plus():             return term, ZeroOrMore(['+', '-'], term)
def term():             return factor, ZeroOrMore(['*', '/'], factor)
def factor():           return Optional(['+','-']), [value, parenthesis]
def value():            return [literal, symbol]

def literal():          return [numericLiteral, stringLiteral, constantLiteral ]
def numericLiteral():   return _(r'\d*\.\d*|\d+') #(E(+|-)?\d+)?
def stringLiteral():    return [_(r'".*?"'), _(r"'.*?'")]
def constantLiteral():  return [_(r"True"), _(r"False"), _(r"NULL")]
def symbol():           return _(r"\w+")
def parenthesis():      return SuppressStrMatch("("), disjunction, SuppressStrMatch(")")

ExpressionParser:ParserPython = ParserPython(expression, comment, debug=False, reduce_tree=True, ignore_case=True)

when I execute a parse: expression = ExpressionParser.parse("""(p in [37,47] and (c in ["A", "P"] or l==true)) or (p in [75,76,361,390,391] and c in ["A", "P"])""")

it takes about 8 seconds. that is way to long. I would like it to be way below 1 second for parsing such a simple expression.

could anyone assist in explaing or advising how to fix the grammar? (which is directly derived from the peg parser of python itself and translated to arpeggio style)

thanks carsten

Thanks for pushing me in the right direction

cons0l3 avatar Apr 14 '21 08:04 cons0l3

oh, and I have really aweful exceptipons when the parsed string has syntax issues. it is always stack exceptions and not the nice exception "hey at position 3 after the "or" something is not right..." kind of deal.

cons0l3 avatar Apr 14 '21 08:04 cons0l3

For these kinds of grammars where there is a lot of backtracking (expressions grammars are a good example) you should turn on memoization which can yield a massive speed improvements.

Here is a test on my machine:

import time
# Without memoization
ExpressionParser = ParserPython(expression, comment, debug=False, reduce_tree=True,
                                ignore_case=True)
exp_str = """(p in [37,47] and (c in ["A", "P"] or l==true))
             or (p in [75,76,361,390,391] and c in ["A", "P"])"""
start = time.process_time()
exp_nomem = ExpressionParser.parse(exp_str)
print(exp_nomem.tree_str())
print("TIME:", time.process_time() - start, "\n")

# With memoization
ExpressionParser = ParserPython(expression, comment, debug=False, reduce_tree=True,
                                ignore_case=True, memoization=True)
start = time.process_time()
exp_mem = ExpressionParser.parse(exp_str)
print(exp_mem.tree_str())
print("TIME:", time.process_time() - start)

Output:

expression=Sequence [1-110]
  disjunction=OrderedChoice [1-108]
    conjunction=OrderedChoice [1-45]
      element_in=Sequence [1-12]
        symbol=RegExMatch(\w+) [1-2]: p
...
  EOF [110-110]: 
TIME: 9.193519810000002 

expression=Sequence [1-110]
  disjunction=OrderedChoice [1-108]
    conjunction=OrderedChoice [1-45]
      element_in=Sequence [1-12]
...
  EOF [110-110]: 
TIME: 0.004547567999999558

As you can see, without memoization it takes over 9 secs while with memoization it takes only 4 ms.

For handling exception see this part of the docs

igordejanovic avatar Apr 14 '21 09:04 igordejanovic

TLDR; for exception the easy way to get just the error message without stacktrace is to catch NoMatch exception and just print it.

igordejanovic avatar Apr 14 '21 09:04 igordejanovic

thanks igor! it seems i can also improve performance by rewriting a few rules. but memotization already improved significantly.

cons0l3 avatar Apr 14 '21 09:04 cons0l3

you are my hero of the day!

cons0l3 avatar Apr 14 '21 09:04 cons0l3

I rewrote the first few rules to be:

def disjunction():      return conjunction, ZeroOrMore( Kwd("OR"), conjunction )
def conjunction():      return inversion, ZeroOrMore( Kwd("AND"), inversion)
def inversion():        return Optional(Kwd('NOT')), comparison
def comparison():       return [ binaryCompareOp, notin, element_in, isnot, element_is, plus ]
def binaryCompareOp():  return plus, ['==','!=','<=','<','>=','>'], plus

and that alone helped considerably plus Igors tipp.

Problem solved.

cons0l3 avatar Apr 14 '21 11:04 cons0l3

I'm glad it solved your problem. I'll leave the issue open labeled with question to be easier to find.

igordejanovic avatar Apr 14 '21 12:04 igordejanovic