Plans for error recovery?
It looks like there's been some work done in extending PEG generators to support error recovery (producing a syntax tree even in the face of some syntax errors) -- see this blog post and this paper. Have y'all thought about supporting something like that in peggy?
You mean to add a fail-safe parsing to the Peggy grammar itself or add the ability to generate fail-safe parsers? If the last I don't think that we need some special syntax for that, because existing one already quite straightford. The example from the blog post written in the Peggy would look like:
{
let errors = [];
function report(message, loc = location()) {
errors.push({
location: loc.start.offset + ".." + loc.end.offset,
message,
});
}
}
source_file = ast:failsafe_expr EOF .* {
return { ast, errors };
};
failsafe_expr = expr / '' {};
expr = paren / ident / error;
ident = $([a-zA-Z_] [a-zA-Z0-9_']*);
paren
= '('
e:(expr / '' { report("expected expression after `(`"); })
(')' / '' { report("missing `)`"); })
{ return [e]; }
;
error = err:$(!')' .)+ { report("unexpected `" + err + "`"); };
EOF = !. / '' { report("expected EOF"); };
Here the undefined is equivalent to the Error variant, array is equivalent to the Parens variant and string is equivalent to the Ident variant.
Error spans for the (foo)) and () not the same, but I think that in Peggy variant they are more logical
@Mingun thanks for the tip but I've found that applying this approach to a more complex grammar becomes very cumbersome / hard to keep track of. I wonder if there is any interest in providing a more "batteries included" solution for those of us parser hobbyists that need to write parsers that work well with IntelliSense, for example
I understand your desires, but what you suggest to improve in the Peggy grammar? You anyway need to manually specify synchronization points in your grammar. Doing that with dedicated syntax or with existing one. Because the existing abilities does not create much overhead I do not see what benefits we can get from the additional syntax. Of course, if you can provide a syntax for that and to show a benefits from the automaticrealization of these concepts I'll no against. But right now I do not see this benefits
The same paper from the arxiv.org (unlike dl.acm.org it is available for free). The authors also noted that recovery rules written manually can give are more precise error recovery.
However, I think you can add a syntax that simplifies the creation of nodes for automatic recovery, for example:
expression
%"<error message>"
would be be translated to:
expression / '' { report("<error message>"); }
where report is a new API function for registering parser errors, similar to expected()/error() but does not stop parsing.
Then the original grammar
expr = paren / ident;
ident = $[a-z]i+;
paren = '(' expr ')';
with minimal changes could be converted to the grammar with some error recovery mechanisms:
expr = paren / ident;
ident = $[a-z]i+;
paren
= '('
expr % "expected expression after `(`"
')' % "missing `)`"
;
Such modified grammar is able to produce result for the incomplete inputs, such as (foo (instead of (foo)), but inputs with excess symbols (such as (foo))) requires additional rules for the error recovery.
As a further development, a special operator can be introduced for inferring error message from a labeled expression:
expr "Expression" = paren / ident;
ident = $[a-z]i+;
paren
= '('
expr %! // produces "expected `Expression`" error - name of the referenced rule
')' %! // produces "expected `)`" or maybe specially for literals "missed `)`"
;
The rules for automatically generated messages also could be configurable.
Feel free to play with the implementation but IMO the syntax not very clear and probably it won't be so flexible as a custom solution described above.
How well would this work with just a report function, so we don't have to add more syntax?
It already works very well without any additions (and IMO report function also not needed if we don't want to implement syntax additions)