pest
pest copied to clipboard
[RFC] "Template"/Macro/Generic rules
Motivation
One of the advertised features of LALRPOP, macros/templates/generics are a useful tool for factoring out common parts of your grammar. The common example is CommaSeparated<production>
to represent production ~ ("," ~ production)* ~ ","?
. (This can also be written (production ~ ",")* ~ production?
, but I prefer the former formulation.)
In this RFC I lay out how a design for generic rules might look in pest, and attempt to make a case for their implementation.
A proposal for standard casing
In the 2.0 version of pest, the standard casing sees builtin rules in SHOUT_CASE
and user rules recommended to be in snake_case
. This RFC proposes that generic rules could be TitleCase
by convention, along with their arguments, to distinguish them from normal rules.
Guide-Level Explanation
(Shamelessly adapted from the LALRPOP book, which is licensed MIT/Apache as is LALRPOP itself)
When writing grammars we encounter repetitive constructs that might normally be copy-and-pasted. A common example is something like a "comma-separated list". If we want to parse a comma-separated list of expressions, it might look something like:
expressions = { expression ~ ("," ~ expression)* ~ ","? }
But what happens if later we want a comma-separated list of term
s, or anything else? For this, pest offers generic rules. By using a generic rule, we can factor out this common functionality into one place.
expressions = { CommaSeparated<expression> }
terms = { CommaSeparated<term> }
CommaSeparated<Rule> = _{ Rule ~ ("," ~ Rule)* ~ ","? }
Because CommaSeparated
is marked as a silent rule with a _
, this means this is functionally equivalent to inlining its structure into both expressions
and terms
. If a generic rule is not silenced, it will be included in the output structure just like any other rule.
Implementation-Level Explanation
There are two ways to handle generic rules. In the first, we treat it as a template, and generate multiple parsing functions for each instantiation. In the second, we pass along the generics to the Rust code.
I will explain via walking through in pseudocode the following example (note that no rules are silent, unlike above):
expressions = { CommaSeparated<expression> }
terms = { CommaSeparated<term> }
CommaSeparated<Rule> = { Rule ~ ("," ~ Rule)* ~ ","? }
Template desugaring
For each unique rule that is passed into a generic rule, desugar to a new rule instantiated with the concrete rule(s) passed to the generic rule.
expressions = { CommaSeparated__expression }
terms = { CommaSeparated__term }
CommaSeparated__expression = { expression ~ ("," ~ expression)* ~ ","? }
CommaSeparated__term = { term ~ ("," ~ term)* ~ ","? }
The generated rules do not correspond to unique Rule
enum variants in the output, however; all generated rules from the same generic rule map to the same Rule
enum variant.
Generic implementation
In addition to the parser state, the generated function for parsing this rule takes an argument representing what rule is passed as its generic argument. It then calls said function for any time the generic argument is present in the definition.
fn CommaSeparated(
state: Box<::pest::ParserState<Rule>>,
Rule: impl Fn(Box<::pest::ParserState<Rule>>) -> ::pest::ParseResult<Box<::pest::ParserState<Rule>>>,
) -> ::pest::ParseResult<Box<::pest::ParserState<Rule>>> {
Rule(state)?
.many(0.., |state| Rule(state.literal(",")?))?
.optional(|state| state.literal(","))
}
Grammar changes
The terminal
rule is changed to accommodate generic rules:
-terminal = _{ _push | identifier | string | insensitive_string | range }
+terminal = _{ _push | (identifier ~ ("<" ~ CommaSeparated<terminal> ~ ">")?) | string | insensitive_string | range }
In the future, we may wish to relax this such that a generic rule can take a term
or even expression
instead. Conversely, we may wish to only accept one terminal
instead of a list to begin with.
Prior Art
- LALRPOP, as linked above
Unresolved Questions
- Do we want to pass along the genericness to the generated Rust code, or "monomorphize" the rules ourselves?
- How generic should the generics be allowed to be? Do we want to support the more generic
Separated<Term, By> = { Term ~ (By ~ Term)* ~ By? }
? Do we need to support it? - Does paramaterizing generics over
term
orexpression
instead ofterminal
give any more convenience to the user? Any generic rule can be expressed solely by taking a terminal by just defining a silent terminal to be the desired more complicated expression.
CCing some of the linked project authors whose grammar looked at a glance that they might benefit from this (and are currently using the derive grammar):
@sunng87 (handlebars-rust), @jturner314 (py_literal), @wahn (rs_pbrt), @Keats (tera),
I wrote this RFC because I'm currently wishing I had it in my personal project. If we want to write a large language grammar using pest, this re-usability factor is probably much more useful than even #197.
I like the idea and I think it could potentially be something to be included in 2.0 or 2.1. However, I would prefer a slightly different approach. How about every rule can take arguments like normal functions? rule
would thus be equivalent to rule()
. Everything else would stay the same.
The implementation would be a bit more demanding, since the AST would need to be changed to some degree. Monomorphization would also be needed in order to have good performance, probably implemented as an optimization step in meta
.
@CAD97, would you be willing to take a jab at it once we put everything in order?
Yep, I can work on an MVP implementation. I like the idea of making every rule a function that takes rule arguments.
If we translate this to generics at the Rust level, Rust will take care of the monomorphization pass for us.
This is definitely a 2.1 thing rather than a 2.0 blocker, though. Either formulation of <>
or ()
is a clean extension to the existing 2.0 syntax.
There were little comma-separated syntax in handlebars, but I think this can be a good addition to pest. Also I'm a fan of a more generic Separated<Term, By, EscapedBy>
implementation.
Also how about a Quoted<Term, StartBy, EndBy, EscapedBy>
?
Sounds like a good idea to me!
Also I'm a fan of a more generic Separated<Term, By, EscapedBy> implementation.
+1 on that