participle icon indicating copy to clipboard operation
participle copied to clipboard

Proposal: Error tolerant parsing

Open alecthomas opened this issue 3 years ago • 3 comments

Participle currently supports returning a partial AST when parsing fails, but only up to the point of failure. This issue proposes that Participle reject failed tokens but continue to parse, if possible.

(transcribed from Slack)

For participle to produce a full AST even when errored, my initial thought is that it would need to combinatorially match tokens to all branches of the parse tree, back to the root

eg. (hypothesising here) given this kind of input

a = "foo" bar
c = 10

and this grammar

type Assignments struct {
  Assignment []*Assignment `@@*`
}
type Assignment struct {
  Ident string `@Ident "="`
  Value *Value `@@`
}
type Value struct {
  String *string  `  @String`
  Number *float64 `| @Number`
}

Once it hits the tokens [bar, c, =, 10], bar would correctly parse into Assignment.Ident but Value would fail, so the parser would perhaps reset back to the start of Assignment, but reject the token bar as an error.

There'd have to be some mechanism for capturing the failed branches. Probably out of band somehow, or possibly the partially valid sub-tree could be inserted with a marker indicating it's an error. Similar to how Pos lexer.Position is special cased there could be something like ParseFailure lexer.Position that is populated if that branch has errors.

It would have to be opt-in though, as that kind of backtracking could be quite expensive.

I don't like that ParseFailure. What might be better is RejectedTokens []lexer.Token. It might not be ideal to require that be part of the AST 🤔.

So the AST in this case might end up something like:

Assignments{
  RejectedTokens []lexer.Token{...},
  Assignment: []*Assignment{
    Assignment{ /* a = "foo" */ },
    Assignment{ /* bar */, RejectedTokens: ...},
    Assignment{ /* c = 10 */ },
  },
}

The RejectedTokens could be populated all the way back to the root, potentially.

alecthomas avatar Aug 27 '20 01:08 alecthomas

As per further conversation in Slack, it is probably a better idea not to preserve rejected nodes as multiple rejections could pollute the AST. Unclear until implementation.

alecthomas avatar Aug 27 '20 01:08 alecthomas

A more complex example:

AST

type Decls struct {
  Funcs []*Func `@@*`
}

type Func struct {
  Name string `"func" @Ident "(" ")" "{"`
  Body []*Stmt `@@* "}"`
}

type Stmt struct {
  Asgn *Asgn `  @@`
  If *If  #   `| @@`
}

type Asgn struct {
  Name string `@Ident "="`
  Value int `@Int`
}

type If struct {
  Condition string `@Ident "{"`
  Body []*Stmt `@@* "}"`
}

Source

func foo() {
  a = 10
  if true {
}

func bar() {
}

Parsed

Decls{
  Func{Name: "foo", Body: {
    Stmt{Asgn{Name: "a", Value: 10}},
    Stmt{If{Condition: "true", Body: {}}},
  }}
  Func{Name: "bar", Body: {}}
}

In this case there are no additional tokens. On the contrary, there are missing tokens. Need a way to convey this.

Maybe an Error participle.Error field?

alecthomas avatar Aug 27 '20 13:08 alecthomas

The proposed error mechanism here sounds like what's called "synchronisation", where a rule is marked as a synchronisation point. Whenever a token that doesn't fit is encountered, the parser unwinds until it hits a synchronisation point, at which point it starts discarding tokens until it finds one that fits in the current context, and resumes parsing from there.

IMO ideally the synchronisation points would be user specifiable with the ability to implement an interface for even more control over how the parser synchronises back to a sane state.

pontaoski avatar Oct 20 '21 17:10 pontaoski