lark
lark copied to clipboard
I managed to create a non repeatable parser
It seems I have some real bad problems with my tentative implementation. Launching the program several times I end up with (at least) two different outputs, even if I'm parsing from a static string.
Another thing I wasn't able to understand is how to implement keywords; since keywords also match identifier syntax trying to mix the two results in a never-terminating parser.
My current code follows.
#!/usr/bin/env python3
from lark import Lark, Transformer
test = """\
@startuml
NotShooting : "La Befana vien di notte..."
[*] --> NotShooting
state NotShooting {
[*] --> Idle : :F: loopfunc
Idle --> Configuring : :Ev:33 "EvConfig" :Fi: start_config :Fo:end_config
Configuring --> Idle : :Ev:0xff "EvConfig"
}
state Configuring {
[*] --> NewValueSelection
NewValueSelection --> NewValuePreview : "EvNewValue"
NewValuePreview --> NewValueSelection : "EvNewValueRejected"
NewValuePreview --> NewValueSelection : "EvNewValueSaved"
/'
this is a multi-line comment
which will get ignored
'/
state NewValuePreview {
State1 -> State2
}
}
@enduml
"""
parser = Lark('''\
IDENT : /[_a-zA-Z][_a-zA-Z0-9]*/
USTART : "@startuml" NEWLINE
UEND : "@enduml" NEWLINE
START : "[*]"
STRANS : "-->"
HTRANS : "->"
STATE : "state"
WHATEVER: /[^\\r\\n]+/
diagram : USTART diag UEND
diag : (cstate | sstate | trans)+
sstate : IDENT [ ":" (sdef | STRING)+ ] NEWLINE
cstate : STATE IDENT "{" NEWLINE diag "}" NEWLINE
trans : (IDENT | START) (STRANS | HTRANS) IDENT [ ":" (tdef | sdef | STRING)+ ] NEWLINE
sdef : fidef | fodef
fidef : ":Fi:" _func
fodef : ":Fo:" _func
tdef : evdef | f_def
evdef : ":Ev:" _event
f_def : ":F:" _func
_event : number
_func : IDENT
number : DEC_NUMBER | HEX_NUMBER | OCT_NUMBER
DEC_NUMBER: /[1-9]\d*l?/i
HEX_NUMBER: /0x[\da-f]*l?/i
OCT_NUMBER: /0o?[0-7]*l?/i
COMMENT : "/'" /(.|\\r|\\n)+/ "'/" [NEWLINE]
| "'" WHATEVER NEWLINE
STRING : "\\"" ("\\\""|/[^"]/)* "\\""
%ignore COMMENT
%import common.WS_INLINE
%ignore WS_INLINE
%import common.NEWLINE
''', start='diagram')
ast = parser.parse(test)
class State(object):
states = {}
def __init__(self, name, parent=None):
self.name = name
self.parent = parent
self.transitions = list()
self.infunc = None
self.outfunc = None
State.states[name] = self
@staticmethod
def get(name):
try:
return State.states[name]
except KeyError:
s = State(name)
return s
def add_trans(self, trn):
self.transitions.append(trn)
class Transition(object):
def __init__(self, src, dst, evt=None, rtn=None):
self.source = src
self.destination = dst
self.event = evt
self.routine = rtn
sstate = State.get(src)
# dstate = State.get(dst)
sstate.add_trans(self)
class MyT(Transformer):
def __init__(self):
super(MyT, self).__init__()
self.states = {}
self.f = None
self.fi = None
self.fo = None
self.ev = None
def diagram(self, matches):
d = matches[1]
return d
def diag(self, matches):
d = self.states
self.states = {}
return d
def sdef(self, matches):
print('sdef : ', matches[0])
return matches[0]
def tdef(self, matches):
print('tdef : ', matches[0])
return matches[0]
def evdef(self, matches):
t = matches[0]
print("evdef: %s" % t)
self.ev = t
return t
def number(self, matches):
t = matches[0]
n = int(t, 0)
print("number: %d" % n)
self.ev = n
return n
def f_def(self, matches):
t = matches[0]
print("f_def: '%s'" % t)
self.f = t
return t
def fidef(self, matches):
t = matches[0]
print("fidef: '%s'" % t)
self.fi = t
return t
def fodef(self, matches):
t = matches[0]
print("fodef: '%s'" % t)
self.fo = t
return t
def sstate(self, matches):
nam = matches[0]
s = State.get(nam)
if self.fi is not None:
s.infunc = self.fi
self.fi = None
if self.fo is not None:
s.outfunc = self.fo
self.fo = None
self.states[nam] = s
return s
def cstate(self, matches):
d = matches[3]
nam = matches[1]
s = State.get(nam)
for t in d.values():
t.parent = s
if self.fi is not None:
s.infunc = self.fi
self.fi = None
if self.fo is not None:
s.outfunc = self.fo
self.fo = None
return matches
def trans(self, matches):
src = matches[0]
dst = matches[2]
t = Transition(src, dst, evt=self.ev, rtn=self.f)
self.ev = None
self.f = None
return t
bst = MyT().transform(ast)
Hi, what are the two outputs? I ran it a dozen times and always got the same result:
f_def: 'loopfunc'
tdef : loopfunc
number: 33
evdef: 33
tdef : 33
Regarding keywords, I'm not sure what you mean by a "never-terminating parser", but lark automatically resolves collisions between strings internally. (in some situations, collisions between regexps cannot be resolved, but I don't think that's what you're refering to)
at first try (loading code from the previous message):
/usr/bin/python3.5 /home/mcon/projects/lark/prova.py
f_def: 'loopfunc'
tdef : loopfunc
number: 33
evdef: 33
tdef : 33
fidef: 'start_config'
sdef : start_config
fodef: 'end_config'
sdef : end_config
number: 255
evdef: 255
tdef : 255
Process finished with exit code 0
Three times the same and then:
/usr/bin/python3.5 /home/mcon/projects/lark/prova.py
f_def: 'loopfunc'
tdef : loopfunc
number: 33
evdef: 33
tdef : 33
Process finished with exit code 0
The first one is (more or less) the correct one.
About the "never-terminating parser": it turns out it actually terminates (in error) after several minutes computation (on that small input!). I'm investigating about what's wrong on my grammar...
Well, I would guess that the
WHATEVER: /[^\\r\\n]+/
Terminal is pretty challenging, since it means it has to try every variation of ANY TEXT.
@erezsh Is there any recommendations on how to structure regexes to avoid inconsistent parsing? I've also run into a case where my tests will be "flaky" because they will fail every so often due to the regex
MOT: (/[a-zA-Z]+/i)+
TOKEN: (/[\w|\.|\:|\/]+/i)
will sometimes conflict and instead of outputting ASK
will output A
and SK
. is there a. way to ensure that the match will complete the word as a whole?
I am using the earley
parser
PS: I've seen in a issue #108 that we can use ~1..99 but this is not documented, can you comment on it's use? and how would you like to receive contributions? (Documentation wise)
Hi @sjolicoeur,
- The Earley parser should be consistent. If you have a grammar that parses differently each time and you can share it, please do.
- The best way to avoid that situation is by making the terminals unambiguous. That usually applies in two ways. The relevant one in this case, I think, is making it so that ["A", "SK"] won't be a correct interpretation. Which means avoiding structures like
TOKEN+
. - Yes, the range operator is new, so I didn't get the chance to document it yet. I'll be very happy to receive documentation contributions.
Feel free to open a new issue, or to move the discussion to the newsgroup.
Hi @erezsh,
Thanks for the quick reply! concerning point 2. it is a regex, so the behavior I would expect would be for it to match till the end of the word, I even tried to add \W
in the hope it would also include the space.
That said, I tried add thing lexer="standard"
and it seems it. breaks my current grammar in many ways, I'll try to fix the errors and see if my tests pass
@erezsh I posted a test app that is close to my domain and presents some of the issues i encountered https://groups.google.com/forum/m/#!topic/lark-parser/_win328S65s
Yes, I can see the bug. Thanks for providing some code.
I've narrowed it down to this example:
from lark import Lark
grammar = """
start: (","|rest)+
rest: WORD+
WORD: /\w+(?=[ ,])/
%ignore " "
"""
parser = Lark(grammar)
tree = parser.parse("1code2 more words more and more 1code2 , bag of words key ")
print( tree.pretty() )
I'll try to figure it out when I have the time.