lark icon indicating copy to clipboard operation
lark copied to clipboard

I managed to create a non repeatable parser

Open mcondarelli opened this issue 6 years ago • 8 comments

It seems I have some real bad problems with my tentative implementation. Launching the program several times I end up with (at least) two different outputs, even if I'm parsing from a static string.

Another thing I wasn't able to understand is how to implement keywords; since keywords also match identifier syntax trying to mix the two results in a never-terminating parser.

My current code follows.

#!/usr/bin/env python3

from lark import Lark, Transformer

test = """\
@startuml
NotShooting : "La Befana vien di notte..."
[*] --> NotShooting

state NotShooting {
  [*] --> Idle : :F: loopfunc
  Idle --> Configuring : :Ev:33 "EvConfig" :Fi: start_config :Fo:end_config 
  Configuring --> Idle : :Ev:0xff "EvConfig" 
}

state Configuring {
  [*] --> NewValueSelection
  NewValueSelection --> NewValuePreview : "EvNewValue"
  NewValuePreview --> NewValueSelection : "EvNewValueRejected"
  NewValuePreview --> NewValueSelection : "EvNewValueSaved"

  /'
    this is a multi-line comment
    which will get ignored
  '/
  state NewValuePreview {
     State1 -> State2
  }

}
@enduml
"""

parser = Lark('''\
    IDENT   : /[_a-zA-Z][_a-zA-Z0-9]*/
    USTART  : "@startuml" NEWLINE
    UEND    : "@enduml" NEWLINE
    START   : "[*]"
    STRANS  : "-->"
    HTRANS  : "->"
    STATE   : "state"
    WHATEVER: /[^\\r\\n]+/
    
    diagram : USTART diag UEND
    
    diag    : (cstate | sstate | trans)+
    
    sstate  : IDENT [ ":" (sdef | STRING)+ ] NEWLINE
    cstate  : STATE IDENT "{" NEWLINE diag "}" NEWLINE
    
    trans   : (IDENT | START) (STRANS | HTRANS) IDENT [ ":" (tdef | sdef | STRING)+ ] NEWLINE
    
    sdef    : fidef | fodef
    fidef  : ":Fi:" _func
    fodef  : ":Fo:" _func
    
    tdef    : evdef | f_def
    evdef  : ":Ev:" _event
    f_def  : ":F:" _func

    _event  : number 
    _func   : IDENT
    
    number  : DEC_NUMBER | HEX_NUMBER | OCT_NUMBER
    DEC_NUMBER: /[1-9]\d*l?/i
    HEX_NUMBER: /0x[\da-f]*l?/i
    OCT_NUMBER: /0o?[0-7]*l?/i

    COMMENT : "/'" /(.|\\r|\\n)+/ "'/" [NEWLINE]
            | "'" WHATEVER NEWLINE
    STRING  : "\\"" ("\\\""|/[^"]/)* "\\""
    %ignore COMMENT
    %import common.WS_INLINE
    %ignore WS_INLINE
    %import common.NEWLINE
    
''', start='diagram')

ast = parser.parse(test)


class State(object):
    states = {}

    def __init__(self, name, parent=None):
        self.name = name
        self.parent = parent
        self.transitions = list()
        self.infunc = None
        self.outfunc = None
        State.states[name] = self

    @staticmethod
    def get(name):
        try:
            return State.states[name]
        except KeyError:
            s = State(name)
            return s

    def add_trans(self, trn):
        self.transitions.append(trn)


class Transition(object):
    def __init__(self, src, dst, evt=None, rtn=None):
        self.source = src
        self.destination = dst
        self.event = evt
        self.routine = rtn
        sstate = State.get(src)
        # dstate = State.get(dst)
        sstate.add_trans(self)


class MyT(Transformer):
    def __init__(self):
        super(MyT, self).__init__()
        self.states = {}
        self.f = None
        self.fi = None
        self.fo = None
        self.ev = None

    def diagram(self, matches):
        d = matches[1]
        return d

    def diag(self, matches):
        d = self.states
        self.states = {}
        return d

    def sdef(self, matches):
        print('sdef : ', matches[0])
        return matches[0]

    def tdef(self, matches):
        print('tdef : ', matches[0])
        return matches[0]

    def evdef(self, matches):
        t = matches[0]
        print("evdef: %s" % t)
        self.ev = t
        return t

    def number(self, matches):
        t = matches[0]
        n = int(t, 0)
        print("number: %d" % n)
        self.ev = n
        return n

    def f_def(self, matches):
        t = matches[0]
        print("f_def: '%s'" % t)
        self.f = t
        return t

    def fidef(self, matches):
        t = matches[0]
        print("fidef: '%s'" % t)
        self.fi = t
        return t

    def fodef(self, matches):
        t = matches[0]
        print("fodef: '%s'" % t)
        self.fo = t
        return t

    def sstate(self, matches):
        nam = matches[0]
        s = State.get(nam)
        if self.fi is not None:
            s.infunc = self.fi
            self.fi = None
        if self.fo is not None:
            s.outfunc = self.fo
            self.fo = None
        self.states[nam] = s
        return s

    def cstate(self, matches):
        d = matches[3]
        nam = matches[1]
        s = State.get(nam)
        for t in d.values():
            t.parent = s
        if self.fi is not None:
            s.infunc = self.fi
            self.fi = None
        if self.fo is not None:
            s.outfunc = self.fo
            self.fo = None
        return matches

    def trans(self, matches):
        src = matches[0]
        dst = matches[2]
        t = Transition(src, dst, evt=self.ev, rtn=self.f)
        self.ev = None
        self.f = None
        return t


bst = MyT().transform(ast)

mcondarelli avatar Feb 06 '18 19:02 mcondarelli

Hi, what are the two outputs? I ran it a dozen times and always got the same result:

f_def: 'loopfunc'
tdef :  loopfunc
number: 33
evdef: 33
tdef :  33

Regarding keywords, I'm not sure what you mean by a "never-terminating parser", but lark automatically resolves collisions between strings internally. (in some situations, collisions between regexps cannot be resolved, but I don't think that's what you're refering to)

erezsh avatar Feb 08 '18 16:02 erezsh

at first try (loading code from the previous message):

/usr/bin/python3.5 /home/mcon/projects/lark/prova.py
f_def: 'loopfunc'
tdef :  loopfunc
number: 33
evdef: 33
tdef :  33
fidef: 'start_config'
sdef :  start_config
fodef: 'end_config'
sdef :  end_config
number: 255
evdef: 255
tdef :  255

Process finished with exit code 0

Three times the same and then:

/usr/bin/python3.5 /home/mcon/projects/lark/prova.py
f_def: 'loopfunc'
tdef :  loopfunc
number: 33
evdef: 33
tdef :  33

Process finished with exit code 0

The first one is (more or less) the correct one.

About the "never-terminating parser": it turns out it actually terminates (in error) after several minutes computation (on that small input!). I'm investigating about what's wrong on my grammar...

mcondarelli avatar Feb 09 '18 09:02 mcondarelli

Well, I would guess that the

WHATEVER: /[^\\r\\n]+/

Terminal is pretty challenging, since it means it has to try every variation of ANY TEXT.

erezsh avatar Feb 09 '18 10:02 erezsh

@erezsh Is there any recommendations on how to structure regexes to avoid inconsistent parsing? I've also run into a case where my tests will be "flaky" because they will fail every so often due to the regex

MOT: (/[a-zA-Z]+/i)+
TOKEN: (/[\w|\.|\:|\/]+/i)

will sometimes conflict and instead of outputting ASK will output A and SK. is there a. way to ensure that the match will complete the word as a whole?

I am using the earley parser

PS: I've seen in a issue #108 that we can use ~1..99 but this is not documented, can you comment on it's use? and how would you like to receive contributions? (Documentation wise)

sjolicoeur avatar Apr 26 '18 19:04 sjolicoeur

Hi @sjolicoeur,

  1. The Earley parser should be consistent. If you have a grammar that parses differently each time and you can share it, please do.
  2. The best way to avoid that situation is by making the terminals unambiguous. That usually applies in two ways. The relevant one in this case, I think, is making it so that ["A", "SK"] won't be a correct interpretation. Which means avoiding structures like TOKEN+.
  3. Yes, the range operator is new, so I didn't get the chance to document it yet. I'll be very happy to receive documentation contributions.

Feel free to open a new issue, or to move the discussion to the newsgroup.

erezsh avatar Apr 27 '18 08:04 erezsh

Hi @erezsh,

Thanks for the quick reply! concerning point 2. it is a regex, so the behavior I would expect would be for it to match till the end of the word, I even tried to add \W in the hope it would also include the space.

That said, I tried add thing lexer="standard" and it seems it. breaks my current grammar in many ways, I'll try to fix the errors and see if my tests pass

sjolicoeur avatar Apr 28 '18 05:04 sjolicoeur

@erezsh I posted a test app that is close to my domain and presents some of the issues i encountered https://groups.google.com/forum/m/#!topic/lark-parser/_win328S65s

sjolicoeur avatar Apr 30 '18 23:04 sjolicoeur

Yes, I can see the bug. Thanks for providing some code.

I've narrowed it down to this example:

    from lark import Lark

    grammar = """
        start: (","|rest)+
        rest: WORD+
        WORD: /\w+(?=[ ,])/
        %ignore " "
    """

    parser = Lark(grammar)
    tree = parser.parse("1code2 more words more and more 1code2 , bag of words key ")
    print( tree.pretty() )

I'll try to figure it out when I have the time.

erezsh avatar May 01 '18 08:05 erezsh