chevrotain icon indicating copy to clipboard operation
chevrotain copied to clipboard

Adaptive Predict and failing tests

Open bd82 opened this issue 2 years ago • 7 comments

Hi @msujew

I've started looking into the failing tests, specifically this test: https://github.com/Chevrotain/chevrotain/blob/ef3ef96cc08c13210c2109c480544f30f43ef290/packages/chevrotain/test/full_flow/backtracking/backtracking_parser_spec.ts#L72-L90

For reference this is the ambiguous grammar rule being tested

  • https://github.com/Chevrotain/chevrotain/blob/ef3ef96cc08c13210c2109c480544f30f43ef290/packages/chevrotain/test/full_flow/backtracking/backtracking_parser.ts#L93-L138 It seems that regardless whether I use backtracking or not, the first (0) alternative is chosen instead of the second (1) regardless of the fact it does not match the token sequence.

This raises a few questions:

  1. Is there a bound on the adaptive prediction being implemented in Chevrotain?
  • Meaning would it ever stop after a certain number of tokens?
  • Would you expected adaptive prediction to resolve this grammar ambiguity without any usage of backtracking?
  • The LL(*) document mentioned some fallback option to backtracking in the general high level description.
  1. How are gates/predicates implemented in the adaptive prediction algorithm?
  • I have not dived into the implementation yet...
  • Do the predicates get evaluated before / during / after traversing the DFA?
  • Are there performance implications to the manner in which the predicates get evaluated?
  • Could / Should predicates be modeled as a subset of the alternatives by filtering out the none viable one (invalid gates) in advance and build a unique DFA for each of those alternative sub-sets variants?
    • There would be 2^g (g being number of gates) possible variants.
  1. Does the ambiguity detection logic need to modified with the introduction of adaptive prediction?
    • It seems that validation is still using fixed K lookahead to determine if there is an ambiguity

bd82 avatar Jan 08 '22 18:01 bd82

With a little more debugging it seems as if the ATN reaches an accepting state when reaching the : (colon) tok and ignoring the FQN part, so it never get to the truly distinguishing token (equals vs default).

Are invoked subrules not get modeled in the ATN? (this.SUBRULE6(this.qualifiedName))?

It also seems as if the predicates:

  • get evaluated after reaching an accepting state
  • may return an alternative that would not match the syntax (tokens vector), in order words if there are predicates they completely dis-regard the accepting state index detected by the ATN.

https://github.com/Chevrotain/chevrotain/blob/df2a11f7a26f8ecddc6c3a44efcfb112eee865b3/packages/chevrotain/src/parse/grammar/atn_simulator.ts#L78-L82 https://github.com/Chevrotain/chevrotain/blob/df2a11f7a26f8ecddc6c3a44efcfb112eee865b3/packages/chevrotain/src/parse/grammar/atn_simulator.ts#L132-L143

bd82 avatar Jan 08 '22 19:01 bd82

This particular error in the backtracking example is caused by the ambiguity detection. It currently ignores the rule stack, which leads to a detected ambiguity when entering the qualifiedName rule. The algorithm takes subrules correctly into account.

I'm currently on the road, fixing this once I'm back later.

Regarding predicates: we evaluate predicates only at accepting states, since there's no good way to deal with them before that due to the DFA structure. It think this only fails in cases, where a longer alternative could be parsed but due to the predicate, it is no longer valid and the shorter alternative should be chosen.

Does the ambiguity detection logic need to modified with the introduction of adaptive prediction?

Well, I'm not actually sure how to deal with this, since static ambiguity detection with unbounded lookahead is an undecidable problem. There are research tools out there which attempt to do this, but I haven't looked into them in detail. Antlr4 does not do it statically, and simply defers that task to the dynamic analysis.

msujew avatar Jan 09 '22 10:01 msujew

This particular error in the backtracking example is caused by the ambiguity detection.

I disabled the static validations and still experienced the issue. are you referring to a different ambiguity detection?

Regarding predicates: we evaluate predicates only at accepting states, since there's no good way to deal with them before that due to the DFA structure. It think this only fails in cases, where a longer alternative could be parsed but due to the predicate, it is no longer valid and the shorter alternative should be chosen.

In the line return this.evaluatePredicates(d.predicates) does d.predicates include all the predicates or some subset of those?

Well, I'm not actually sure how to deal with this, since static ambiguity detection with unbounded lookahead is an undecidable problem. There are research tools out there which attempt to do this, but I haven't looked into them in detail. Antlr4 does not do it statically, and simply defers that task to the dynamic analysis

Removing the static analysis for ambiguity detection is a valid choice if it no longer fits the lookahead algorithm. But what do you mean by defers that task to the dynamic analysis?

bd82 avatar Jan 09 '22 14:01 bd82

Okay I see there is indeed filtering on the predicates that could potentially match the current accepting state:

  • https://github.com/Chevrotain/chevrotain/blob/df2a11f7a26f8ecddc6c3a44efcfb112eee865b3/packages/chevrotain/src/parse/grammar/atn_simulator.ts#L148-L155

bd82 avatar Jan 09 '22 15:01 bd82

I disabled the static validations and still experienced the issue. are you referring to a different ambiguity detection?

Yes, since static ambiguity detection is not possible, Antlr (and my code as well) employs a heuristic to determine ambiguity during dynamic analysis:

https://github.com/Chevrotain/chevrotain/blob/df2a11f7a26f8ecddc6c3a44efcfb112eee865b3/packages/chevrotain/src/parse/grammar/atn_simulator.ts#L387-L390

In basic it does the following: First, if we find that every ATNState which we have in our configs are rule end states, then we definitely know that it is ambiguous, since we have gone through the whole rule without finding a token which resolves the ambiguity. There is also a second heuristic, which is kinda hard to explain in words, but makes sense in the code example above.

The concrete implementation of this heuristic differs a bit from the original Antlr implementation. I just implemented it like Antlr did, and came up with the next issue: Antlr's heuristic is too strict and does not correctly identify the ambiguities in the ECMA5 grammar, parsing almost all input before determining that it is ambiguous. While it fixes the test case, the ECMA5 parser now doesn't work anymore :/

I also just retested the the example using a parser generated by Antlr. It needs 500 tokens of Lookahead, until it determines that the alternative is indeed ambiguous. It it also incredibly slow. I think we need some other approach to ambiguity detection.

msujew avatar Jan 09 '22 15:01 msujew

@bd82 I thought about this issue for a few days now. I don't really see a way to reliably fix the ambiguity issue for the ECMA5 parser. Instead I would propose the following compromise:

We use the LL(*) lookahead algorithm as the default lookahead function. However, when the maxLookahead variable is set either on the grammar level or on the alternative level, we switch to the LL(k) lookahead for the whole grammar or the alternative respectively.

Pros:

  • Full backwards compatibility for users which already use maxLookahead on a grammar level
  • Predictable ambiguity resolving for edge cases in which the heuristic fails
  • Ambiguity detection for LL(k) grammar decisions
  • No additional flags on grammar/alternative level needed

Cons:

  • Higher maintenance cost and general complexity, since both algorithms need to be supported
  • More documentation effort to make the distinction between both lookahead algorithms clear

What's your opinion on that? By the way, for simpler communication, I sent you a friend request on Discord.

msujew avatar Jan 14 '22 13:01 msujew

Feel free to merge PRs to the master-ll-star branch as it will likely be a while before I have the time to dive into LL(*) topic

bd82 avatar Jan 31 '22 11:01 bd82