An alternative to the `Pair` API
I've been working on a project with a fairly large grammar for a while now, and in the process I've had to write and maintain code to convert the output of pest parsing to the actual AST I need in my project. I started out with a bunch of ad-hoc macros to help with dealing with the Pair API, but after a lot of iteration I got to a nice design that I thought could be of use for others.
(I did have a look at pest-ast, but I couldn't find basic documentation, and it was too rigid for my use-case)
So I made it into a crate, called pest_consume. It uses procedural macros and works on stable Rust. It is flexible, and makes code that is IMHO easy to write, read, and update when the grammar changes.
I thought I'd advertise it here because I feel it neatly complements pest to make writing parsers a great experience. Also, since there has been discussion around changing the pest API for pest3, I propose this design as something that has worked for me on a large project.
I'd be excited to hear what you guys think about it, if the docs are clear enough and if you think you may want to use it in your project ! If the pest devs like it, maybe this crate could be mentioned in the pest docs ?
Hi,
I was just beginning to rewrite the parsing loop of my pet project to make it less awful when you announced pest_consume, which looks really sweet. I will give it a try ASAP, and probably ask you a couple questions ;)
Thanks for it, doc is really awesome ! And I feel I’ll have a cleaner code than by using Pair API (no offense, @dragostis ;)
Hi, Seeing #418 , have you done any bench vs Pair api ?
PS1: I have begun to use pair_consume, but I’m stuck for a couple days until 1) I have some spare time to work again on it 2) I can get the beast to compile. PS2: Do you plan to get on pest gitter ? I’d be nice to interact :)
I haven't done any benchmarks, mostly because I have no idea how to do them correctly. I'm not expecting significant differences in performance however since in most cases the produced code is essentially what you would have written by hand with the Pair API
This is sorely needed! I don't really get why Rules are all untyped when Pest has all the type information available and is a code generator! As far as I can tell Tree Sitter works in the same way as Pest - it just gives you an untyped DOM-style tree of nodes. But Nom actually gives you the final Rust output. Once you have written your parser and told Nom to parse it there's no extra "oh, you still have to parse it yourself though" step.
Consider the CSV example:
field = { (ASCII_DIGIT | "." | "-")+ }
record = { field ~ ("," ~ field)* }
file = { SOI ~ (record ~ ("\r\n" | "\n"))* ~ EOI }
Pest gives you something like this.
Rule ("file")
Rule ("record")
Rule ("field")
Rule ("field")
Rule ("record")
Rule ("field")
Rule ("field")
...
The rule variants are actually native rust enums but in practice it makes zero difference to my coding (unfortunately code completion doesn't work) because they are all the same!
enum Rule {
field,
file,
record,
}
impl Rule {
fn into_inner(...) -> Rule {}
}
Why can't it do this instead?
struct Field {
span: Span,
}
struct Record {
span: Span,
field_0: Field,
field_1: Vec<Field>,
}
struct File {
span: Span,
record: Vec<Record>,
}
Another example, INI:
char = { ASCII_ALPHANUMERIC | "." | "_" | "/" }
name = { char+ }
value = { char* }
section = { "[" ~ name ~ "]" }
property = { name ~ "=" ~ value }
file = {
SOI ~
((section | property)? ~ NEWLINE)* ~
EOI
}
Would become
struct Char {
span: Span,
}
struct Name {
span: Span,
chars: Vec<Char>,
}
struct Value {
span: Span,
chars: Vec<Char>,
}
struct Section {
span: Span,
name: Name,
}
struct Property {
span: Span,
name: Name,
value: Value,
}
struct File {
span: Span,
section_or_property: Vec<SectionOrProperty>,
}
enum SectionOrProperty {
Section(Section),
Property(Property),
}
I suppose you may want to avoid creating a Vec<Char> somehow. Perhaps it could be done lazily, or is there some way of making a rule "silent"? In Tree Sitter you prepend _ to the name.
It's not only the Vec<char>, you're building up Vecs and nested structs everywhere in your solution. This is probably fine if you intend to build up an AST-like node tree anyway, but for other parsing jobs any such allocation is unneeded and wasteful.
I'm not saying that Pair is the best possible API for this kind of parsing, but this isn't it.
This is probably fine if you intend to build up an AST-like node tree anyway
Exactly! I think this should be optional because sometimes you want extreme speed and you don't to store a full AST. But in the vast majority of cases you do. In the examples in the book, the INI, JSON and J language all do this. Only the CSV parser avoids it.
What you propose could use a custom "container" type like Pairs, but typed. It would construct the children on-demand when you iterate it instead of pre-allocating it. Then I think the performance and memory overhead should be comparable to the Pair API
Sounds good! By the way I found someone has articulated this issue much better than me.
pest-ast is an experimental library to add a typed conversion on top of pest proper. I'll also note that tree libraries like Rowan are pursuing an untyped tree with a typed API provided over it.
Ultimately, pestv3 is planned to provide a listener API as well as integrate something like pest-ast as a first class citizen to produce typed trees from that listener API. Unfortunately, pestv3 progress is stalled as dragostis hasn't had enough time to work on it recently.