lark
lark copied to clipboard
Inline rules expand ambiguities unnecessarily
Thanks for fixing #1214 earlier. It now seems to return all ambiguities, but ambiguities inside inline rules are expanded within the parser rather than leaving it to CollapseAmbiguities transformer.
I have some input strings that explode very quickly when expanding ambiguities. I parse them with ambiguity=explicit, then count the expected number of expanded ambiguities. If it's within a reasonable threshold, I put the tree through CollapseAmbituities. Otherwise, the processing stops (until I find a better way to deal with such strings). As the result, I still can't use inline rules because parsing of some inputs explodes before I can even count the number of ambiguities.
To reproduce, parse this string:
t = parser.parse('1 2 3')
print(t.pretty())
with this grammar
start: field+
field: f1 | f2
f1: INT
f2: INT
%ignore WS
%import common.WS
%import common.INT
Result:
start
_ambig
field
f1 1
field
f2 1
_ambig
field
f1 2
field
f2 2
_ambig
field
f1 3
field
f2 3
Now, inline field:
start: _field+
_field: f1 | f2
f1: INT
f2: INT
%ignore WS
%import common.WS
%import common.INT
The parser now expands the nested ambiguities prematurely:
_ambig
start
f1 1
f1 2
f1 3
start
f1 1
f1 2
f2 3
start
f1 1
f2 2
f1 3
start
f1 1
f2 2
f2 3
start
f2 1
f1 2
f1 3
start
f2 1
f1 2
f2 3
start
f2 1
f2 2
f1 3
start
f2 1
f2 2
f2 3
@chanicpanic The saga continues!
Sorry, messed up the original examples, corrected now.
The example provided is actually a special case that may be open to optimization. Consider the general case in which expansions of field may contain more than one rule:
start: field+
field: f1 ws | f2 ws2
f1: INT
f2: INT
ws: WS
ws2: WS
%import common.WS
%import common.INT
Parsing the text "1 2 3 " produces:
start
_ambig
field
f1 1
ws
field
f2 1
ws2
_ambig
field
f1 2
ws
field
f2 2
ws2
_ambig
field
f1 3
ws
field
f2 3
ws2
If we inline field in this tree, we get:
start
_ambig
f1 1
ws
f2 1
ws2
_ambig
f1 2
ws
f2 2
ws2
_ambig
f1 3
ws
f2 3
ws2
which is not correct. In this case, I see no more compact way to represent the ambiguities than to expand them all.
We may be able to optimize the special case in which all instances of _field have a single child. However, it is a tradeoff. It would add complexity during tree building for a smaller tree size. On the other hand, expanding the tree is work that may be done later anyways when collapsing ambiguities.
@yurymann Conditional inlining may be your friend for highly ambiguous rules:
start: field+
?field: f1 | f2
f1: INT
f2: INT
%ignore WS
%import common.WS
%import common.INT
gives:
start
_ambig
f1 1
f2 1
_ambig
f1 2
f2 2
_ambig
f1 3
f2 3
You may also want to look at lark-ambig-tools which can help you process highly ambiguous trees more efficiently.
@chanicpanic, wow, conditional inlining looks indeed like THE solution for my case.
And thanks for the lark-ambi-tools link! Disambiguator is particularly a thing I was looking for and was almost ready to write myself.