lark
lark copied to clipboard
Earley parser produces wrong parse Tree
Over on SO someone asked this question: https://stackoverflow.com/questions/76366280/parsing-formulas-using-lark-ebnf/76381256
As far as I can tell, it shows a bug in the earley parser where it duplicates some Tokens because of enormous amounts of ambiguities. I don't have the expertise with earley to figure out what is happening or the time to create a minimal example right now.
I have it down to:
grammar = """
start: (a+)*
!a.1: "A" |
"""
l = Lark(grammar)
tree = l.parse("A")
print(tree.pretty())
Output:
start
a
a A
a
a A
I suspect that the issue is somewhere in ForestToParseTree
with resolve_ambiguity=True
. I do not see the same problem in the tree returned when using ambiguity="explicit"
. I'll look into it more when I have a chance.
The issue is that when ambiguity="resolve"
, the ParseTreeBuilder
will use the ChildFilterLALR
filters. However, the ForestToParseTree
cache breaks the ChildFilterLALR
assumption that it is safe to change the parse tree. We have two options:
- Disable the
ForestToParseTree
cache whenambiguity="resolve"
. - Always use the regular
ChildFilter
for earley.
Either will work, but they may have different performance implications.
@chanicpanic Thanks for looking into it!
Can you explain more about the performance implications?
The ForestToParseTree
cache and ChildFilterLALR
filter are both optimizations that happen to be incompatible with each other. So, the question is: which optimization gives the most performance benefit for earley with ambiguity resolution?
While the ForestToParseTree
cache provides a significant speed-up for explicit ambiguous parses, its impact on ambiguity resolution is likely minimal. My hypothesis is that the cache is only relevant when a rule matches an empty string, and even then, the forest nodes have to be visited in a particular order.
On the other hand, I believe that child filtering is an operation that is performed for many grammars and inputs. So, the ChildFilterLALR
would be used much more often.
Thus, I am leaning toward option 1 because I expect the ChildFilterLALR
to have a greater overall performance benefit.
@chanicpanic Thanks for the explanation.
I think it's best to make an evidence-based choice, which makes me wonder, do you have an idea for what a good benchmark grammar&input for Earley would be?
I could run benchmarks on trick-grammars like start: start start | "(" start ")" |
, and normal grammars that are low on ambiguity. But maybe you know some real-world examples that are ambiguous and would serve as a better measurement?
cc: @MegaIng
I stumbled across what I think is the same problem. Here is the minimal example from my grammar:
GRAMMAR = r'''
start : _s x _s
x : "X"?
_s : " "?
'''
l = Lark(GRAMMAR)
# behave as expected
print(l.parse( " X " ).pretty())
print(l.parse( "X" ).pretty())
print(l.parse( " " ).pretty())
# produces two x nodes
print(l.parse( "" ).pretty())
output:
start
x
start
x
start
x
start
x
x
I think it's best to make an evidence-based choice, which makes me wonder, do you have an idea for what a good benchmark grammar&input for Earley would be?
I could run benchmarks on trick-grammars like
start: start start | "(" start ")" |
, and normal grammars that are low on ambiguity. But maybe you know some real-world examples that are ambiguous and would serve as a better measurement?
@erezsh I don't have any real-world grammars that I think would be a better benchmark for Earley.
For this issue, I think it would be good to use the grammars in this thread, or variations that incorporate the same patterns, which are known to utilize the ForestToParseTree
cache with ambiguity="resolve"
. Also, it would be good to benchmark with normal grammars that reflect more typical lark usage.
I made branches for the two options that can be used for comparison:
- https://github.com/chanicpanic/lark/tree/issue1283_no_ftpt_cache
- https://github.com/chanicpanic/lark/tree/issue1283_no_child_filter_lalr
Hi @chanicpanic and everyone,
Sorry it took me this long to look into this!
I did some benchmarks, and my conclusion is it that - it really doesn't matter.
Both approaches seem to have the same performance in my tests, with less than 5% difference.
So, just choose whatever seems more "correct". I will accept a PR of either approach.