chumsky Recovery logic affects parsing of valid input

I would expect that adding recovery logic to a grammar doesn't functionally change the grammar, but currently it does: repeated() and or_not() (at least) with an inner that can recover will currently greedily consume the recovered input, even though the recovered error may have occurred because the input for or_not() simply wasn't there, or there was one less input for repeated(). Surprisingly, or() does work how I'd expect, so x.or_not() differs in behavior from x.map(Some).or(empty().map(|_| None)).

Here's some failing test cases to illustrate. All the provided inputs are valid, and thus should return Ok(); which they do when you remove the recover_with() calls.

#[test]
fn recovery_or_not() {
    let parser = just::<_, _, Simple<char>>('x')
        .then(just('y'))
        .recover_with(skip_until(['y'], |_| Default::default()).consume_end())
        .or_not()
        .then(just('z'));
    assert_eq!(parser.parse("xyz"), Ok((Some(('x', 'y')), 'z')));
    assert_eq!(parser.parse("z"), Ok((None, 'z'))); // FAILS
}

#[test]
fn recovery_or() {
    let parser = just::<_, _, Simple<char>>('x')
        .then(just('y'))
        .recover_with(skip_until(['y'], |_| Default::default()).consume_end())
        .or(just('a')
            .then(just('b'))
            .recover_with(skip_until(['b'], |_| Default::default()).consume_end()))
        .then(just('z'));
    assert_eq!(parser.parse("xyz"), Ok((('x', 'y'), 'z')));
    assert_eq!(parser.parse("abz"), Ok((('a', 'b'), 'z')));
}

#[test]
fn recovery_repeated() {
    let parser = just::<_, _, Simple<char>>('x')
        .then(just('y'))
        .recover_with(skip_until(['y'], |_| Default::default()).consume_end())
        .repeated()
        .at_most(2)
        .then(just('z'));
    assert_eq!(parser.parse("xyxyz"), Ok((vec![('x', 'y'), ('x', 'y')], 'z')));
    assert_eq!(parser.parse("xyz"), Ok((vec![('x', 'y')], 'z'))); // FAILS
    assert_eq!(parser.parse("z"), Ok((vec![], 'z'))); // FAILS
}

If this is unintended (presumably at least the inconsistency is), I'm pretty sure I can figure out how to fix it and submit a PR.

Jun 13 '22 16:06 jvanstraten

Unfortunately, this is a little difficult to fix in the general case, because the number of potential alternative parse paths blows up exponentially (we don't need to simply backtrack, but also enumerate all possible alternative lengths of input accepted by repeated, separated_by, etc.). It's also mechanically quite complicated and would radically increase the complexity of the crate. The docs do point to this issue, but perhaps not clearly enough.

You are right that the situation could be improved for specific parsers like or_not, although the logic for or is currently quite convoluted and would probably require generalising for such cases. You're welcome to give improving this situation a go (although it might be best to work from the new zero-copy branch, because it's a near-total rewrite of the crate), however.

As mentioned elsewhere in the docs, parser recovery is not a silver bullet and I strongly recommend using more specific recovery strategies where possible to avoid this problem (nested_delimiters is powerful enough to be useful in a surprising number of cases while also being very specific about the input it will accept).

Jun 13 '22 16:06 zesterer

When I was hacking away at my own parser stuff (before giving up and trying Chumsky instead), my approach for recovery for alternations was:

for each alternative:
- parse alternative without enabling recovery:
  - if successful, return its result
  - otherwise, save the amount of tokens consumed before the parser got stuck, and continue
(input is invalid)
if recovery is enabled:
- for each alternative, ordered by descending amount of tokens consumed:
  - parse alternative with recovery enabled:
    - if recovery successful, return its recovered result
    - otherwise, continue
- (input is invalid and recovery failed)
return failure

This isn't immediately applicable to Chumsky (my parsers could only return success or failure, enabling recovery was a flag, and recovered errors would be pushed to a vector of errors rather than be returned), but I think it could be adjusted to work for Chumsky too by buffering the recovered parse trees rather than redoing the parsing (though I'm not sure how to query the amount of tokens matched without errors). The logic can be trivially generalized to greedy repetitions and concatenations by not backtracking after each loop iteration, and x.or_not() is just x.or(empty()). The root parser is normally initialized with recovery enabled.

I'm not sure how or why this would explode, at least not for valid input; in fact, valid input never hits the recover code at all. For invalid input it might be slower than what Chumsky currently does, not sure. The algorithm is nontrivial when you write it out, sure, but...

Writing parsers with good error recovery is conceptually difficult and time-consuming. It requires understanding the intricacies of the recursive descent algorithm, and then implementing recovery strategies on top of it.

... that's why people use parser libraries rather than reinventing the wheel :)

parser recovery is not a silver bullet and I strongly recommend using more specific recovery strategies where possible to avoid this problem

I interpreted that line in the docs to mean that the quality of the error messages degrades when you spam recovery logic without thinking about it, which I understand; those test cases are just test cases, not my idea of idiomatic parsers. I didn't expect it to start rejecting valid input without a much bigger warning and an explanation of why, though...

FWIW, I ran into this because for the project I'm working on I want to parse a very dumb, documentation-focused as-close-to-"standard"-EBNF-as-possible grammar into best-effort (but correct) parser code for various frameworks. The problem I'm trying to solve is mostly to single-point-of-proof the documentation/specification of a DSL with parser implementations in various languages; I want to target at least ANTLR and something in stable Rust, and maybe I'll see how bad things get if I try to throw flex/bison into the mix because I've used that before (which is LALR instead of LL/recursive descent). So, indeed, my generator does currently spam recovery logic rather mindlessly. I do want it to use nested_delimiters() where possible, but either my generator has to somehow heuristically detect when that applies, or I'll have to provide it with additional Chumsky-specific information, which I'd like to avoid.

Jun 13 '22 17:06 jvanstraten

Ah,

Because chumsky is a PEG parser, which always take the first successful parsing route through a grammar, recovering from an error may cause the parser to erroneously miss alternative valid routes through the grammar that do not generate recoverable errors. If you run into cases where valid syntax fails to parse without errors, this might be happening: consider removing error recovery or switching to a more specific error recovery strategy.

Must have glossed over that note a gazilion times. That's clear enough IMO, my bad. (still don't think it should do this, though.)

Jun 13 '22 17:06 jvanstraten

my approach for recovery for alternations was:

This sounds like an interesting approach. Do you think it would be possible to mock up a trivial example of it in practice, perhaps with a few basic combinators (just, then, or, repeated`)? If it's feasible to implement this reliably and fairly performantly, I'd definitely like to!

Jun 13 '22 17:06 zesterer

The primary reason I gave up writing my own library is because I confused myself with all the type/template stuff involved with combinators, but I'll give it another shot now that I'm more used to it from working with Chumsky. Might be able to just refactor Chumsky to implement (a proof of concept of) it.

Jun 13 '22 18:06 jvanstraten

Okay, went a little bit nuts and basically just started from scratch. Notably, I handled source location tracking a bit more generically, because I found myself fighting Chumsky a bit in that regard. I have exactly zero tests thus far, my lifetimes might not be correct everywhere yet (it's zero-copy though), and I didn't port every single combinator that exists in Chumsky yet, but it's pretty complete now and it's really time to go to bed: https://github.com/jvanstraten/parser-proof-of-concept

I'll probably add some tests and chase bugs tomorrow.

Jun 14 '22 20:06 jvanstraten

Thanks so much for this! I'll hopefully have time to take a deeper look next week.

Jun 17 '22 09:06 zesterer

As of yet it's not completely done and usable yet; I realized earlier this week that I had forgotten to actually implement recovers_with() so there wasn't really a way to trigger the recovery logic (my excuse is that it was late), and I've since nerd-sniped my way into implementing recovery using combinators rather than having to rely on the three recovery methods currently built into Chumsky. I know more or less how I want to do it, but I'm still having a hard time getting it written down such that Rust can derive all the generics on the traits. I reverted that stuff on master just now so that it'll build and all tests will pass, but will continue on my recovery-combinators branch.

I do have basic test cases for the individual primitives and combinators now, though. They all seem to work fine.

Anyway, let me know which parts you like and which you don't when you have time to look into it. It's starting to look like a parsing crate in its own right, but my intention is still to help get things (conceptually) merged into Chumsky.

Jun 17 '22 15:06 jvanstraten

FWIW, a more generic approach to error recovery, probably utilising the existing parsing system, is desired and will probably be coming in zero-copy (see #77), so it sounds like this is in line with that. Unfortunately, I'm away this weekend so I'll need to wait until Monday or Tuesday to check this out.

Jun 17 '22 15:06 zesterer

Got recovery combinator type derivation working by adding a random PhantomData; not sure why that made it work exactly, but having previously refactored my entire codebase three or so times to each time receive a new error message, I'm not going to question it. I think all that's left is to implement Clone for all the combinators, and then I should be able to generate a more serious test case.

Jun 20 '22 13:06 jvanstraten

Okay, I think I'm pretty much done now. I ran into a bunch of ownership issues when trying to implement something half-serious, so it took me longer than I hoped it would. I now have a handwritten JSON parser and my absolutely terrible bootstrapping code for that EBNF-based generator I'm working on as test cases, and both seem to work.

It does appear to be a bit slower than Chumsky. On my machine, Chumsky's benchmark takes 4.94ms to parse that benchmark JSON file, my parser takes 8.56ms, and pom takes 15.5ms. I'm kinda hoping that's just down to how I wrote the grammar and cutting corners here and there while utilizing the code-fast approach, though. It's not even hitting the recovery logic, so I don't see why it would be almost twice as slow from an algorithmic point of view.

I've also gotten myself seriously confused about zero-copy stuff in the past few days. I started out by hardcoding references to the input type all over the place, but mostly that just made a mess and made it very difficult to deal with Copy things like char. Then at some point I realized I could just remove all those references and bind a reference type to I when I don't want the input tokens to be cloned, and get pretty much exactly the same code as I would have gotten otherwise... so I did, and it works fine. So now I'm confused why that wouldn't also apply to Chumsky? Is there something preventing the current version from being used that way, or am I completely misunderstanding what zero-copy means in this context?

Jun 22 '22 13:06 jvanstraten

Then at some point I realized I could just remove all those references and bind a reference type to I when I don't want the input tokens to be cloned, and get pretty much exactly the same code as I would have gotten otherwise... so I did, and it works fine. So now I'm confused why that wouldn't also apply to Chumsky?

Chumsky works fine on tokens that contain references. However, the main use-case for zero-copy parsing is the ability to refer back to long tracts of the input stream (i.e: parse a byte buffer, have have a &[u8] in the output that refers back to data in the buffer), something that master does not currently support.

Have you taken a look at https://github.com/zesterer/chumsky/pull/82 ? It's basically a from-scratch redesign of the crate with support for zero-copy parsing in the way I described above (and a lot of other things). Performance is also much improved!

Unfortunately, I'm going through a period of burnout with things right now so I'm finding it tough to make progress. I'm sure this will have cleared up in a few weeks. There's enough there to play with though, despite error recovery logic being missing right now. I'm also pretty pleased with the quality of the code, it's substantially cleaner than master.

Perhaps your approach to error recovery could become the new strategy for zero-copy in time.

Jun 23 '22 14:06 zesterer

Chumsky works fine on tokens that contain references. However, the main use-case for zero-copy parsing is the ability to refer back to long tracts of the input stream (i.e: parse a byte buffer, have have a &[u8] in the output that refers back to data in the buffer), something that master does not currently support.

I see. So rather than a Vec of tokens (even if those tokens are actually just references) you'd always get a slice directly. It sounds to me like that implies that the input always has to be a slice, though? What if you want to stream something or use owned tokens? Is that just not supported at that point, or is there some magic that somehow allows both methods?

Have you taken a look at #82 ?

I have to admit I haven't in great detail. I assumed that, aside from it being zero-copy, it would work more-or-less the same. Sounds like it's a lot more different than I thought.

Unfortunately, I'm going through a period of burnout with things right now

Sorry to hear that :(

FWIW, it sounds like I'm going to be working on other projects soon-ish. I stand by the idea that a universal-ish DSL specification language is a good idea, so maybe I'll pick it back up at some point, but it was always pretty far from the stuff I actually should be working on. So no rush from me.

I'm sure there's more to it than Chumsky, but keep in mind:

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND

You don't owe anyone here anything by taking some time for yourself ;)

Jun 23 '22 16:06 jvanstraten

It sounds to me like that implies that the input always has to be a slice, though? What if you want to stream something or use owned tokens?

If you want to grab a slice, yes. The API has changed in the zero-copy branch to accommodate this and allows abstracting over slice-like things with the Input trait. It's still possible to use an arbitrary iterator (like Stream does on master), but you lose the ability to pull slices out of it, as you might expect. Thankfully, traits let us abstract over both cases!

I stand by the idea that a universal-ish DSL specification language is a good idea, so maybe I'll pick it back up at some point, but it was always pretty far from the stuff I actually should be working on. So no rush from me.

That's fair enough. Thanks very much for writing that example implementation, I'll definitely be referring to it when I go about designing chumsky's new error recovery system.

You don't owe anyone here anything by taking some time for yourself ;)

Thanks. I frequently forget this fact, often to my detriment :grin:

Jun 23 '22 18:06 zesterer

Fleshing out zero-copy's error recovery system is one of the biggest things yet to do, so I'm tagging this accordingly.

Feb 20 '23 22:02 zesterer

chumsky chumsky copied to clipboard

Recovery logic affects parsing of valid input

chumsky
chumsky copied to clipboard