rust-peg Add error recovery

Add error recovery

Open kw217 opened this issue 2 years ago • 7 comments

When using a parser in the context of an IDE or a compiler, it is desirable to be able to recover from errors, returning a best-efforts parse anyway and reporting all errors rather than just the first. This PR extends rust-peg with support for this, based on Syntax Error Recovery in Parsing Expression Grammars and Automatic Syntax Error Reporting and Recovery in Parsing Expression Grammars.

@kevinmehall I'm not sure if multiple errors is something you want to support, and I'm not 100% confident I've chosen the best syntax, so I would be grateful for your feedback.

Adding error recovery necessarily introduces a breaking change - the public return type of a parser has to be able to include multiple errors. To ease upgrade, the new type has an .into_result() method which returns the original result type.

There are three expressions for raising an error:

error!{ "message" e2 } is the simplest form. It reports an error with the given message at the current location, then attempts to recover by parsing e2 instead.
error_if!{ e1 | "message" e2 } allows the error to be reported at an earlier location. It attempts to parse e1; if it succeeds it reports an error with the given message at the start of e1, then attempts to recover by parsing e2 instead.
error_unless!{ e1 | "message" e2 } is shorthand for a choice expression. It is useful in order to report an error when an expression fails to match. It attempts to parse e1; if it fails it reports an error with the given message at the current location, then attempts to recover by parsing e2 instead.

The result of the parser (ParseResults) contains the final value, failure, or error, along with a list of all errors we recovered from during the parse.

Full documentation is provided in src/lib.rs.

Changes

The new expression forms are represented in the AST by RecoverIfExpr and RecoverUnlessExpr (the simple error! form is sugar for error_if! with an empty e1).
Most test and example code has been upgraded by inserting .into_result().
New tests are added as recovery.rs (the example from the paper) and arithmetic_recovery.rs (a worked example of how to use the power of the new features).
In the parser, RuleResult now has a third alternative (beyond Matched and Failed): Error. This is handled largely following the rules from the papers cited, although a few improvements and optimizations have been added as noted in the comment in src/lib.rs.
Similarly ErrorState gains a set of errors recovered from so far.
I made a few renamings for clarity, since now "error" and "failure" are two different things - e.g., into_parse_error is now into_failure, since it creates a failure not an error, and reparsing_on_error becomes reparsing_on_failure.
In a few places in the code I added comments explaining the meaning of fields and methods (as I figured them out :-) ).

Let me know if you have any questions!

Feb 15 '22 23:02 kw217

It is quite big improvement and I think it will be better to split it on several commits since otherwise it will be difficult to review.

I suppose to split this PR to following commits:

Add documentation to existing stuff
Introduce new return type and upgrade library to use it + migration guide
Introduce new grammar constructs + AST upgrade

Maybe some steps can be split even more, but that is a required minimum as I think.

Feb 16 '22 04:02 Mingun

Thanks @Mingun , I'll split up the PR into separate commits as you suggest.

I've fixed the bug in trace mode that was spotted by the automated tests (oops!).

I just noticed the discussion at #276, which I hadn't seen before I submitted this PR (I did most of the work a few months ago). Here are some thoughts against each of @kevinmehall's headings:

Producing errors as part of the normal result is the intended way of using this extension; however this extension allows you to record those errors too.
I love the idea of checking the parser can't fail. That's not addressed here.
Error localisation - this extension (following the paper) collects the points at which you threw a labelled error, rather than re-running the existing reparsing algorithm.
Labeled failures - that's the paper I based this work on!
Lossless syntax tree. This isn't addressed here either (though some principled way of capturing location/span info is clearly needed, and most parsers will reinvent it somehow).

Feb 16 '22 20:02 kw217

(apologies for the noise)

This is now split into five commits - @Mingun hopefully this is easier to review!

Add extra comments.
Expose (existing) parser return type, and require .into_result().
Bootstrap grammar.rs.
Add error recovery.
Bootstrap grammar.rs.

This is now ready for review. Thanks.

Feb 16 '22 23:02 kw217

Yes, some kind of error recovery is definitely in-scope for rust-peg and I'm glad to see someone working on this. Error recovery is potentially a big addition and I want to make sure we've explored the design space and get it right, so this is going to take some discussion and iteration.

Some initial feedback:

Syntax

It seems that

error_if! { e1 | "message" e2 } is the same as &e1 error!{ "message" e2 }
error_unless!{ e1 | "message" e2 } is the same as e1 / error!{ "message" e2 }

I would lean towards re-using the existing control flow operators unless there's some advantage of duplicating them.

Collecting multiple errors

I'm wary of anything that returns data out of a PEG expression except via its return value, because this doesn't play well with backtracking.

For example,

rule abc() = "a" error_unless!{ "b" | "expected b" [_] }  "c"
rule asdf() = "asdf"
pub rule top() = abc() / asdf()

top("asdf") produces an "expected b" error in its errors list, even though the otherwise-successful parse matched using asdf() instead of abc().

The Medeiros paper doesn't say much about how errors should be collected. You could imagine reverting recorded errors on backtracking, but this seems like it would have a large performance penalty to implement.

This is why on #276 I proposed returning the errors as part of the AST -- you have to construct a placeholder AST node anyway, so why not make it carry the error? That also leaves the error type up to the user, instead of requiring &'static str like here, and avoids complicating the parser API.

If you really want to collect a list of errors without dealing with backtracking, you can always pass in a mutable Vec with the existing support for grammar arguments.

Error vs Failure

What constructs ParseResult::Error? It seems like this is only ever produced if recovery expression e2 fails? Giving error!{} the dual purpose of logging an error and continuing vs aborting the rest of the parse depending on the presence and success of the recovery expression seems like it's unnecessarily combining two pieces of functionality that could be orthogonal.

I also question whether the behavior of preventing further backtracking and aborting the parse is desirable here if the goal is to build an AST that is as complete as possible. You could imagine wanting layered error recovery, e.g. attempting recovery within the rule for expressions, but if that fails, the calling rule tries to recover at the following statement, and if nothing else, you replace the whole function with an error node but continue with the rest of the file.

I do support being more consistent on "error" vs "failure" terminology in the docs and code, though!

I want to re-read the papers more closely, because I feel like I'm missing something there -- the most useful piece from the first paper seems to be e.g. the strategy of the SkipToRCUR rule, but that can be done in any PEG implementation using ordered choice without any special features. Medeiros describes the error handling in terms of labeled failures, but the paper doesn't make much use of that connection -- it doesn't use the extension of the ordered choice operator to catch failures with a particular label. The labels are merely a point to attach a recovery expression, which as your implementation shows, can be done inline instead of with labels, and I suggest is basically the same as falling back to a recovery expression with ordered choice anyway.

So I'm not sure where that leaves this PR. It's been interesting to try out, at least. If you have a grammar that you've used this on, I'd like to see it.

Feb 19 '22 22:02 kevinmehall

Thanks, really appreciate the detailed response, and definitely keen to iterate. This is a hobby project for me so I apologise in advance if it takes me a while to get back to you.

This was a fairly straightforward implementation of the paper, without taking a lot of time to consider whether it was the best approach. You are correct re ParseResult::Error. You ask very sensible questions about backtracking and collection of errors, and I'm intrigued by the idea of separating the logging and recovery/abort. I'll ponder how this might work.

I see your point about reporting errors in the AST, but having them on the side makes sense in terms of the way a compiler or IDE expects to see them - when I compile a Rust program I am presented with a linear list of errors, and in order to generate wiggly lines in an IDE I have to present a list of source ranges. This could be extracted from the AST, but that adds an extra stage which feels artificial.

You are totally right that error_if! and error_unless! are just sugar. However when you actually write a grammar they represent patterns that come up a lot. error_unless! is already in the Medeiros paper, and I found I needed error_if! in order to get errors to show up in the right place - and the repetition of &e1 error!("message" e1 e2) was unpleasant, especially if e1 was complex. See the example arithmetic_recovery.rs for how this is used. The syntax may not be ideal, but I think something like this is necessary for ergonomic reasons.

The best example I have at the moment is arithmetic_recovery.rs; I have another project that inspired this but it's not ready for sharing just yet, which provides syntax highlighting and error reporting for the Monaco editor (the editor used by VS Code) via WASM.

Thanks again. I'll get back to you with a better example of how this can be used, and I'll ponder more broadly whether there's a better approach.

Feb 20 '22 22:02 kw217

Looking at the current mechanism, error recovery is not moved to the lowest precedence alt, which can result in ambiguous matches. So that for example if we have a rule constant and expression , for a gplang, if we match a integer constant as show below, only integer literals are matched.

rule integer() -> Literal
  = val:$(['0'..°'9') ....
  / error!{"nan" 0}

pub rule literal() -> Literal
  = integer
  / floating
  / string
  / error!{"expected a constant value" 0}

If we look at ANTLR, the parser first tries all possible alternatives, then exports the parse trace to a error handler.

ANTLR would produce a state machine like so:

const -> integer>{v} | floating>{v} | string>{v}
expr -> const | interpolation | stmt
unit -> expr | expr>(const>integer>{"nan", 0} | expr>const>{"expected a constant value" 0})

(note that all error recovery is done at the highest level of the tree in the state machine) while the error handling would produce a "state machine" like so:

integer -> [0..=9] {v} | {"nan" 0}. //this can never fail
const -> integer>{v} | floating>{v} | string>{v} | 
expr -> const | interpolation | stmt | {"expected a constant value" 0})
unit -> expr

which is obviously invalid.

Nov 18 '22 18:11 ProphetLamb

Bump Any updates on this? 👀

Nov 06 '23 17:11 DasLixou

rust-peg rust-peg copied to clipboard

Add error recovery

Changes

Syntax

Collecting multiple errors

Error vs Failure

rust-peg
rust-peg copied to clipboard