lark Earley: share nodes created by the scanner with the completer

trafficstars

There should only be one distinct SPPF start node now.

When a start rule alternative ends in a terminal, the scanner creates the start SymbolNode, and when a start rule alternative ends in a nonterminal, the completer creates the start SymbolNode. The issue was that if both cases occurred, the scanner would create a start SymbolNode, and then the completer would also create one instead of reusing the existing node because the completer did not know the node existed. Hence, we ended up with two different start SymbolNodes. The issue is resolved by sharing the node_cache from scan with predict_and_complete.

test_multiple_start_solutions and test_consistent_derivation_order1 were adjusted because the change affected the order in which derivations were produced for some grammars. The order is still consistent across executions though.

test_cycles2 is back to having only one derivation which I believe is the correct behavior after the change.

Before the change, the SPPF was:

sppfbefore

Notice that the traversal that produces the "triple v" derivation contains two symbol nodes labeled (v, 1, 2, -inf). The existence of two symbol nodes with the same label violates the "Shared" property of SPPFs.

After the change, the SPPF is:

sppfafter

Now, there is only one symbol node labeled (v, 1, 2, -inf). A traversal that produces the 'triple v" derivation still exists, but requires traversing the cycle through (v, 1, 2, -inf) which we don't do.

Aug 17 '24 05:08 chanicpanic

Thanks for looking into it, @chanicpanic !

In terms of API, it definitely looks like the right approach.

In terms of implementation, it looks like something is a bit off.

In test_consistent_derivation_order1, there should be 4 derivation.

When I run it with explicit ambiguity on this PR, I get this result, which is incorrect and also repetitive:

start
├── _ambig
│   ├── a
│   └── a
│       └── b
└── _ambig
    ├── a
    └── a
        └── b

The master branch returns the actual 4 derivations correctly.

Aug 18 '24 14:08 erezsh

@erezsh I believe that result is correct, although I can certainly understand why it may not appear so.

The first _ambig node means that the first child of start may be a or a -> b. Likewise, the second _ambig node means that the second child of start may be a or a -> b. With two options for the first child and two options for the second, we have a total of four derivations.

Using CollapseAmbiguities, we can see that we do in fact end up with four unique derivations:

parser = Lark('''
    start: a a
    a: "." | b
    b: "."
''', ambiguity='explicit')

tree = parser.parse('..')

for t in CollapseAmbiguities().transform(tree):
    print(t.pretty())

Output:

start
  a
  a

start
  a
  a
    b

start
  a
    b
  a

start
  a
    b
  a
    b

Aug 20 '24 01:08 chanicpanic

@chanicpanic Yes, you're right. I apologize for my confusion.

There is just one more thing I'm unsure about. It feels like the ordering is a bit arbitrary. In test_consistent_derivation_order1 the order changed towards the shorter derivation, but in test_multiple_start_solutions it changed towards the longer derivation.

I'm worried about it because changing the default order might cause errors for users who have accidental ambiguities in their grammar. So I want to make sure we're now changing to the "right order", so we won't have to break it again in the future.

Aug 20 '24 14:08 erezsh

I'm worried about it because changing the default order might cause errors for users who have accidental ambiguities in their grammar. So I want to make sure we're now changing to the "right order", so we won't have to break it again in the future.

This is a very valid concern.

I believe that the new order is more consistent with the way we typically resolve ambiguities. That is, in lieu of priorities, if there is a rule with multiple possible derivations, we choose the derivation based its alternative's rule order. Concretely, if we have rule: a | b and could resolve it with a or b, we choose a because it is defined first.

In the case of test_multiple_start_solutions, the current order gives the derivation for the terminal A first. However, the grammar defines start: a | A, so we would expect the derivation for the rule a first like the new order provides.

For test_consistent_derivation_order1, the old behavior resulted in duplicate symbol nodes for the a rule. The lack of shared symbol nodes for a meant that the ambiguity was pushed up the SPPF. As a consequence, rule order could not be used to choose a derivation, and the selected derivation was nondeterministic when using non-ordered sets. With the new behavior, symbol nodes for a are properly reused, and the ambiguity is represented by the children of those nodes. Hence, given a: "." | b, the derivation using "." is preferred by rule order. As a matter of fact, with rule order in play, the order of derivations produced by the new changes is consistent even without ordered sets (I just realized that means this test no longer tests what it was intended to).

Overall, I think the new behavior is more correct (no duplicate symbol nodes in the SPPF), aligns with rule order expectations even when an alternative ends in a terminal (as in test_multiple_start_solutions), and produces a more predictable order (as in test_consistent_derivation_order1).

For reference, grammars affected by the change contain a rule:

that has an alternative that ends in a nonterminal
and an alternative that ends in a terminal
both of which may produce identical strings

Aug 21 '24 06:08 chanicpanic

Okay, I'm sufficiently convinced.

Thank you!

Aug 30 '24 08:08 erezsh

lark lark copied to clipboard

Earley: share nodes created by the scanner with the completer

lark
lark copied to clipboard