pest An alternative to the `Pair` API

I've been working on a project with a fairly large grammar for a while now, and in the process I've had to write and maintain code to convert the output of pest parsing to the actual AST I need in my project. I started out with a bunch of ad-hoc macros to help with dealing with the Pair API, but after a lot of iteration I got to a nice design that I thought could be of use for others. (I did have a look at pest-ast, but I couldn't find basic documentation, and it was too rigid for my use-case)

So I made it into a crate, called pest_consume. It uses procedural macros and works on stable Rust. It is flexible, and makes code that is IMHO easy to write, read, and update when the grammar changes.

I thought I'd advertise it here because I feel it neatly complements pest to make writing parsers a great experience. Also, since there has been discussion around changing the pest API for pest3, I propose this design as something that has worked for me on a large project.

I'd be excited to hear what you guys think about it, if the docs are clear enough and if you think you may want to use it in your project ! If the pest devs like it, maybe this crate could be mentioned in the pest docs ?

Sep 20 '19 18:09 Nadrieril

Hi,

I was just beginning to rewrite the parsing loop of my pet project to make it less awful when you announced pest_consume, which looks really sweet. I will give it a try ASAP, and probably ask you a couple questions ;)

Thanks for it, doc is really awesome ! And I feel I’ll have a cleaner code than by using Pair API (no offense, @dragostis ;)

Sep 21 '19 08:09 lwandrebeck

Hi, Seeing #418 , have you done any bench vs Pair api ?

PS1: I have begun to use pair_consume, but I’m stuck for a couple days until 1) I have some spare time to work again on it 2) I can get the beast to compile. PS2: Do you plan to get on pest gitter ? I’d be nice to interact :)

Sep 24 '19 06:09 lwandrebeck

I haven't done any benchmarks, mostly because I have no idea how to do them correctly. I'm not expecting significant differences in performance however since in most cases the produced code is essentially what you would have written by hand with the Pair API

Sep 24 '19 12:09 Nadrieril

This is sorely needed! I don't really get why Rules are all untyped when Pest has all the type information available and is a code generator! As far as I can tell Tree Sitter works in the same way as Pest - it just gives you an untyped DOM-style tree of nodes. But Nom actually gives you the final Rust output. Once you have written your parser and told Nom to parse it there's no extra "oh, you still have to parse it yourself though" step.

Consider the CSV example:

field = { (ASCII_DIGIT | "." | "-")+ }
record = { field ~ ("," ~ field)* }
file = { SOI ~ (record ~ ("\r\n" | "\n"))* ~ EOI }

Pest gives you something like this.

Rule ("file")
  Rule ("record")
    Rule ("field")
    Rule ("field")
  Rule ("record")
    Rule ("field")
    Rule ("field")
...

The rule variants are actually native rust enums but in practice it makes zero difference to my coding (unfortunately code completion doesn't work) because they are all the same!

enum Rule {
  field,
  file,
  record,
}

impl Rule {
   fn into_inner(...) -> Rule {}
}

Why can't it do this instead?

struct Field {
  span: Span,
}
struct Record {
   span: Span,
   field_0: Field,
   field_1: Vec<Field>,
}
struct File {
   span: Span,
   record: Vec<Record>,
}

Another example, INI:

char = { ASCII_ALPHANUMERIC | "." | "_" | "/" }
name = { char+ }
value = { char* }
section = { "[" ~ name ~ "]" }
property = { name ~ "=" ~ value }
file = {
    SOI ~
    ((section | property)? ~ NEWLINE)* ~
    EOI
}

Would become

struct Char {
  span: Span,
}
struct Name {
  span: Span,
  chars: Vec<Char>,
}
struct Value {
  span: Span,
  chars: Vec<Char>,
}
struct Section {
  span: Span,
  name: Name,
}
struct Property {
  span: Span,
  name: Name,
  value: Value,
}
struct File {
  span: Span,
  section_or_property: Vec<SectionOrProperty>,
}
enum SectionOrProperty {
  Section(Section),
  Property(Property),
}

I suppose you may want to avoid creating a Vec<Char> somehow. Perhaps it could be done lazily, or is there some way of making a rule "silent"? In Tree Sitter you prepend _ to the name.

Nov 10 '19 09:11 Timmmm

It's not only the Vec<char>, you're building up Vecs and nested structs everywhere in your solution. This is probably fine if you intend to build up an AST-like node tree anyway, but for other parsing jobs any such allocation is unneeded and wasteful.

I'm not saying that Pair is the best possible API for this kind of parsing, but this isn't it.

Nov 10 '19 10:11 birkenfeld

This is probably fine if you intend to build up an AST-like node tree anyway

Exactly! I think this should be optional because sometimes you want extreme speed and you don't to store a full AST. But in the vast majority of cases you do. In the examples in the book, the INI, JSON and J language all do this. Only the CSV parser avoids it.

Nov 10 '19 10:11 Timmmm

What you propose could use a custom "container" type like Pairs, but typed. It would construct the children on-demand when you iterate it instead of pre-allocating it. Then I think the performance and memory overhead should be comparable to the Pair API

Nov 10 '19 11:11 Nadrieril

Sounds good! By the way I found someone has articulated this issue much better than me.

Nov 11 '19 09:11 Timmmm

pest-ast is an experimental library to add a typed conversion on top of pest proper. I'll also note that tree libraries like Rowan are pursuing an untyped tree with a typed API provided over it.

Ultimately, pestv3 is planned to provide a listener API as well as integrate something like pest-ast as a first class citizen to produce typed trees from that listener API. Unfortunately, pestv3 progress is stalled as dragostis hasn't had enough time to work on it recently.

Nov 11 '19 13:11 CAD97