chumsky icon indicating copy to clipboard operation
chumsky copied to clipboard

Better documentaton of function parsers, pretty please

Open jamesyoungman opened this issue 1 year ago • 15 comments

The examples all (AFAICS) use let to define parsers. But there are use-cases where this isn't convenient:

  1. Unit testing - for more complex languages, it's helpful to expose parsers for parts of the language so that they can be unit-tested independently.
  2. Beginners with chumsky - I'm trying to convert from nom to chumsky and the most obvious way to proceed is bottom-up; to implement chumsky parsers for the simplest parts of the grammar first, and then build upward.

Neither of these things seems very easy when constructing a correct impl Parser<...> declaration for the parser function's return type is entirely unobvious. I mean, the parser itself seems straighforward (obviously I picked a trivial parser for the sake of the example):

/// Accept either Unicode Dot Operator U+22C5 ("⋅") or Unicode Middle Dot U+00B7 ("·").
fn maybe_normal_dot<'srcbody, I: Input<'srcbody>, E>() -> impl Parser<'srcbody, I, bool, E> {
    one_of("\u{22C5}\u{00B7}").to(true).or_not()
}

but that parser can't be used anywhere (with chumsky 1.0.0-alpha.3) because the declared return type is missing a lot of trait bounds. Some guidance from the examples or the documentation would be very welcome.

jamesyoungman avatar Apr 22 '23 08:04 jamesyoungman

Do you know what sort of documentation would help alleviate this issue, out of interest?

zesterer avatar Apr 24 '23 08:04 zesterer

I think the first step would be to better document the parameters of the Parser trait. The documentation of the Input token type and the "extra" trait parameter seems to be very minimal.

Beyond that, an example parser in which many of the non-terminals are parsed by functions which have unit-tests.

If you want to be more ambitious, a tutorial showing the incremental migration of a non-trivial parser to Chumsky from other frameworks (though my interest is primarily nom).

jamesyoungman avatar Apr 24 '23 10:04 jamesyoungman

I just migrated from nom and I've been using the State type / with_state functions to insert AST nodes into an arena, returning a key instead of the node itself. I would be open to contributing some documentation relating to this use case if that would be useful.

willothy avatar May 07 '23 20:05 willothy

I would be open to contributing some documentation relating to this use case if that would be useful.

That would be very much appreciated!

zesterer avatar May 08 '23 18:05 zesterer

I've been using a function-based approach with Chumsky 0.9, exactly because I wanted to make sure I could unit test. But this leads to some issues when recursion is involved, I think (I will explore more tomorrow). And I now understand that the function signature will become more difficult in the Chumsky 1.0 world.

So let me rephrase the question: how to unit test sub-parsers? If we have an answer to that, we know what documentation to provide.

Or is the whole question somehow misguided? How would one construct parsers in a bottom-up manner otherwise? If there are alternative strategies that avoid bottom-up construction with unit testing, I'd love to understand that too, and an explanation of that in the docs would be appropriate.

I read the comment by @willothy but while storing AST nodes into an arena seems cool, I'm unclear how it would help with unit testing parts of a larger parser.

faassen avatar Jul 05 '23 17:07 faassen

There is an example of a function-based parser with recursion at https://github.com/TX-2/TX-2-simulator/blob/eb32020298309e9a631c5ef1fd4a89c52bf8f4f8/assembler/src/asmlib/parser.rs#L77

Message ID: @.***>

jamesyoungman avatar Jul 05 '23 18:07 jamesyoungman

What james linked is a good example - I commonly use a pattern kind of like the following (in 1.0):

type Input<'a> = ...;
type Extra<'a> = ...;

macro_rules! Parser {
    ($lt:lifetime, $ty:ty) => { impl Parser<$lt, Input<$lt>, Output, Extra<$lt>> + Clone + $lt }
}

struct SubExpr {}
struct Expr {}

impl Expr {
    fn parser<'a>() -> Parser!['a, Self] {
        recursive(|expr| SubExpr::parser(expr).whatever(...))
    }
}

impl SubExpr {
    fn parser<'a>(expr: Parser!['a, Expr]) -> Parser!['a, Self] {
        whatever().then(expr)
    }
}

With this pattern, you can easily test sub-parsers by just invoking Expr::parser().parse(), or even handle testing SubExpr::parser(...) in limited ways by feeding it some stub that just emits an expression, like SubExpr::parser(empty().to(Expr::Error)) or such.

I've been using this same pattern since before 0.9, and adapted it to 1.0 when it was developed, and I've found it works amazingly in both versions for me.

CraftSpider avatar Jul 05 '23 20:07 CraftSpider

Thank you both for these answers, I'll take a look!

After I wrote the above, I realized I have another use case for sub-parsers in my code besides unit testing. I use a small subset of the language I'm implementing in a Rust macro to help define types for functions I plug into its library.

Right now I'm using Pest in my project, and it served me well for the last few months, but the code you need to write to turn the Pest parsed result into an AST is of the same magnitude as you'd need to do with Chumsky, and it's messier without combinators.

faassen avatar Jul 06 '23 09:07 faassen

I got things to work in Chumsky 0.9.

Chumsky 1.0a however was much more challenging and I want to give a brief experience report for using functions that return parsers (useful for unit testing, reusing a part of the parser), and Chumsky 1.0 changes in general. Hopefully it can help inform documentation, or perhaps APIs. I realize that some of these are not directly actionable, but I thought I'd thrown them in anyway as food for discussion.

  • the types are now very verbose
  • When you now write a function that returns the wrong type of value, you don't get a clear error message about doing that wrong. You get an error but it's not clear what is wrong, which can be quite the puzzle. In 0.9 I had clearer error messages.
  • filter() now requires any and the second generic type in any has to be the Extra type you define, otherwise it won't work and the error messages aren't clear at all. This eluded me for a long time.
  • I had a custom error struct that implements the trait, but I didn't get that to work yet. It's possible now that I understand the Extra business a bit better I can figure it out. I'm a bit worried I'll lose things that 'Rich' has to offer with my own error type, but I'm not sure whether I would?
  • getting repeated, then map doesn't work anymore. map gets the unit type as an argument. That was very confusing. I have to use collect after repeated instead. Is there a way to avoid that? Couldn't repeated collect automatically? I need to study the docs to see what I missed. There's a common pattern where I want to parse something, then parse a repeated thing, and collect all of them in a vec - maybe there's a better way.
  • fold changed its signature. I think I figured out the new signature but am still unclear as to why the change.
  • SimpleSpan is not hashable, which tripped up my AST which was. I am probably able to replace the span but I don't know how to do that yet
  • In general I'm still lost how the types work; input output and extra types, spans, tokens, ValueInput, StringInput. What do they do? What does it gain me? Is there a way to simplify this for a common case?

I like that there's a structured way to put in state now through Extra; it reduces parameter passing for that use case.

I like the idea that 1.0 offers performance improvements and increased flexibility, but at least for the use case of reusable parser functions, Chumsky 0.9 was a lot easier to handle than Chumsky 1.0a.

Anyway, I did get things to work with Chumsky 1.0a too, but since putting the pieces together was more complicated I thought I'd let you all know. Thanks for Chumsky!

faassen avatar Jul 06 '23 17:07 faassen

My experience (with 1.0.0-alpha.4) closely mirrors that described by Martijn.

Message ID: @.***>

jamesyoungman avatar Jul 06 '23 18:07 jamesyoungman

I must say that my comments are mostly from the perspective of someone upgrading; I see the API docs go into quite a bit of the traits.

That said, repeated/collect (containers?) and all both could do with some more attention.

faassen avatar Jul 06 '23 19:07 faassen

I read the comment by @willothy but while storing AST nodes into an arena seems cool, I'm unclear how it would help with unit testing parts of a larger parser.

That comment wasn't intended to be related to unit testing, just documentation :)

willothy avatar Jul 06 '23 23:07 willothy

Chumsky really wants you define parsers inside of a giant function to get the maximum benefit of the type checker. Using multiple functions creates quite a bit of complexity in signatures and goes against the flow of Chumsky.

So I figured out another way to be able to use (and test) sub-parsers: instead of returning a single Parser back from the single giant parser function, return multiple parsers.

This can be accomplish by a liberal helper of boxing (which is also necessary to get tolerable compile times anyway and according to the docs has virtually no run-time impact):

Here's a sketch of the struct you return (in 1.0a4):

struct ParserOutput<'a, I> where I: ... your input goes here... {
   sub_parser_a: Boxed<'a, 'a, I, TypeA>,
   sub_parser_b: Boxed<'a, 'a, I, TypeB>
}

Then in your test (or wherever), you can extract parsers like that:

let sub_parser_a = parser().sub_parser_a;

To ensure parsers are boxed call .boxed() on them. Remember to do this for the output of the recursive() function as well if you want to write recursive parsers.

There are still cases where you want to factor out functions when you want to create reusable code that's used to construct a lot of parsers: code that takes parsers as arguments and constructs a new parser. I tried this with closures but the type inference wasn't up to snuff, so there was little benefit. Boxed parsers are again helpful there in simplifying things; you're likely going to need this anyway to get Clone support.

faassen avatar Jul 10 '23 13:07 faassen

Here's another report of further progress, in the hope it informs anyone after me reading this (or can serve as eventual documentation). I went with the approach described in the previous comment and ended up with an absolutely huge function (it's a big grammar). It worked well, but eventually vscode's clippy and/or rust analyzer really struggled, and took a long time. I boxed everything I could, but it wasn't enough, which made development more frustrating.

So I went back and split my mega function into multiple functions. There are big sub-sections of the grammar defined in each function, and I'm still returning multiple boxed parsers from each function, but putting some boundaries in the code appears to have helped and while the vscode 'clippy' phase still takes a while, it's now tolerable. And it's nicer to have organized separate things into separate functions, too.

faassen avatar Jul 13 '23 12:07 faassen

Quick note: #481 would make SimpleSpan hashable, since I figured that made sense to add while I was working on it.

CraftSpider avatar Jul 15 '23 00:07 CraftSpider