lark
lark copied to clipboard
Earley: share nodes created by the scanner with the completer
There should only be one distinct SPPF start node now.
When a start rule alternative ends in a terminal, the scanner creates the start SymbolNode, and when a start rule alternative ends in a nonterminal, the completer creates the start SymbolNode. The issue was that if both cases occurred, the scanner would create a start SymbolNode, and then the completer would also create one instead of reusing the existing node because the completer did not know the node existed. Hence, we ended up with two different start SymbolNodes. The issue is resolved by sharing the node_cache from scan with predict_and_complete.
test_multiple_start_solutions and test_consistent_derivation_order1 were adjusted because the change affected the order in which derivations were produced for some grammars. The order is still consistent across executions though.
test_cycles2 is back to having only one derivation which I believe is the correct behavior after the change.
Before the change, the SPPF was:
Notice that the traversal that produces the "triple v" derivation contains two symbol nodes labeled (v, 1, 2, -inf). The existence of two symbol nodes with the same label violates the "Shared" property of SPPFs.
After the change, the SPPF is:
Now, there is only one symbol node labeled (v, 1, 2, -inf). A traversal that produces the 'triple v" derivation still exists, but requires traversing the cycle through (v, 1, 2, -inf) which we don't do.
Thanks for looking into it, @chanicpanic !
In terms of API, it definitely looks like the right approach.
In terms of implementation, it looks like something is a bit off.
In test_consistent_derivation_order1, there should be 4 derivation.
When I run it with explicit ambiguity on this PR, I get this result, which is incorrect and also repetitive:
start
├── _ambig
│ ├── a
│ └── a
│ └── b
└── _ambig
├── a
└── a
└── b
The master branch returns the actual 4 derivations correctly.
@erezsh I believe that result is correct, although I can certainly understand why it may not appear so.
The first _ambig node means that the first child of start may be a or a -> b. Likewise, the second _ambig node means that the second child of start may be a or a -> b. With two options for the first child and two options for the second, we have a total of four derivations.
Using CollapseAmbiguities, we can see that we do in fact end up with four unique derivations:
parser = Lark('''
start: a a
a: "." | b
b: "."
''', ambiguity='explicit')
tree = parser.parse('..')
for t in CollapseAmbiguities().transform(tree):
print(t.pretty())
Output:
start
a
a
start
a
a
b
start
a
b
a
start
a
b
a
b
@chanicpanic Yes, you're right. I apologize for my confusion.
There is just one more thing I'm unsure about. It feels like the ordering is a bit arbitrary. In test_consistent_derivation_order1 the order changed towards the shorter derivation, but in test_multiple_start_solutions it changed towards the longer derivation.
I'm worried about it because changing the default order might cause errors for users who have accidental ambiguities in their grammar. So I want to make sure we're now changing to the "right order", so we won't have to break it again in the future.
I'm worried about it because changing the default order might cause errors for users who have accidental ambiguities in their grammar. So I want to make sure we're now changing to the "right order", so we won't have to break it again in the future.
This is a very valid concern.
I believe that the new order is more consistent with the way we typically resolve ambiguities. That is, in lieu of priorities, if there is a rule with multiple possible derivations, we choose the derivation based its alternative's rule order. Concretely, if we have rule: a | b and could resolve it with a or b, we choose a because it is defined first.
In the case of test_multiple_start_solutions, the current order gives the derivation for the terminal A first. However, the grammar defines start: a | A, so we would expect the derivation for the rule a first like the new order provides.
For test_consistent_derivation_order1, the old behavior resulted in duplicate symbol nodes for the a rule. The lack of shared symbol nodes for a meant that the ambiguity was pushed up the SPPF. As a consequence, rule order could not be used to choose a derivation, and the selected derivation was nondeterministic when using non-ordered sets. With the new behavior, symbol nodes for a are properly reused, and the ambiguity is represented by the children of those nodes. Hence, given a: "." | b, the derivation using "." is preferred by rule order. As a matter of fact, with rule order in play, the order of derivations produced by the new changes is consistent even without ordered sets (I just realized that means this test no longer tests what it was intended to).
Overall, I think the new behavior is more correct (no duplicate symbol nodes in the SPPF), aligns with rule order expectations even when an alternative ends in a terminal (as in test_multiple_start_solutions), and produces a more predictable order (as in test_consistent_derivation_order1).
For reference, grammars affected by the change contain a rule:
- that has an alternative that ends in a nonterminal
- and an alternative that ends in a terminal
- both of which may produce identical strings
Okay, I'm sufficiently convinced.
Thank you!