chumsky icon indicating copy to clipboard operation
chumsky copied to clipboard

Implement Pratt parsing for binary infix operators.

Open alvra opened this issue 2 years ago • 3 comments

I was having trouble parsing binary operators with deeply nested combinators to handle precedence. The Rust compiler time and memory usage grow exponentially until it runs out of memory for the ~10 levels I need. The types generated also start to fill pages at this point. You can reproduce this by adding a few extra operator to the nano_rust example.

This PR implements a Pratt parser specifically designed for binary infix operators that solves these problems. It also provides a much nicer interface IMO (example included).

I'm not sure if this should be included in chumsky itself—it is perhaps too much tied to a single use-case—but it's currently (almost?) impossible to implement such a thing in a separate crate as it requires access to chumsky privates.

The implementation is based on this article: https://matklad.github.io/2020/04/13/simple-but-powerful-pratt-parsing.html

alvra avatar May 17 '22 17:05 alvra

Thank you for your kind words!

I'm glad you like the documentation. I wanted to try and meet the bar you set :)

You're absolutely right about the API mismatch. Your suggestion seems like the most combinator-esque option. To support specifying associativity it could allow a second form for users that need it. Also, the operator parser can return the build function directly.

let expr = atom.binary_infix_operator((
    (0, Associativity::Left, just('+').to(|l, r| Expr::Add(l, r))),
    (0, Associativity::Left, just('-').to(|l, r| Expr::Sub(l, r))),
    (1, Associativity::Left, just('*').to(|l, r| Expr::Mul(l, r))),
    (1, Associativity::Left, just('/').to(|l, r| Expr::Div(l, r))),
));

This is, however, considerably more complex then the other combinators; a "tuple" trait of a "tuple" trait. While more succinct, I wonder if this loses to much in terms of clarity and debuggability; for example if you mix just('+') and just("&&") I suspect the error message will not be very helpful.

alvra avatar May 18 '22 08:05 alvra

@alvra this is super cool! I think it would be great if this could also support unary prefix and suffix operators in the future. If there were functions for defining these different types of operators (similar to the functions in primitives) these could then be passed to atom.pratt(), e.g.

let expr = atom.pratt((
    prefix(0, just('-'), |x| Expr::Neg(x)),
    suffix(0, just('!'), |x| Expr::Fact(x)),
    infix(1, just('+'), |l, r| Expr::Add(l, r)),
    infix(1, just('-'), |l, r| Expr::Sub(l, r)),
    infix(2, just('*'), |l, r| Expr::Mul(l, r)),
    infix(2, just('/'), |l, r| Expr::Div(l, r)),
));

If a person want to be able to parse expression containing parentheses, would they be able to define a recursive parser that composes binary_infix_operator with delimited? If so, it might be good to include an example or a test case that shows this in action.

kevinbarabash avatar Aug 04 '22 01:08 kevinbarabash

I really believe this should make it into master for the next release as the speed benefits can be absolutely absurd, and I do not mean that lightly :D I actually have real life example, so strap in!

I am currently implementing a Lua 5.4 Parser, and was taken aback when this piece of Lua took 17s to parse.

fs.open("license.txt", 'r', "0644", function (err, fd)
  p("on_open", {err=err,fd=fd})
  if (err) then return end
  fs.read(fd, 0, 4096, function (err, chunk)
    p("on_read", {err=err,chunk=chunk,length=#chunk})
    if (err) then return end
    fs.close(fd, function (err)
      p("on_close", {err=err})
      if (err) then return end
    end)
  end)
end)

Since humans make terrible profilers, I grabbed cargo flamegraph and went to work. After running a flamegraph I found that a large majority of the time was spent in Foldl::parse_inner_silent. I only really use Parser::foldl for parsing binary expressions, so the problem portion was easily identifier. The unfortunate part, is that I couldn't (AFAIK) do anything to reduce the backtracking that has to occur between the individual binary expressions; I had already implemented the parsers as suggested by the tutorial, and was running into a wall.

Thankfully, I remembered that @alvra had written this pull request, and decided to see what performance gains could be won. So, I patched my local version of chumsky, and parsed the above lines of Lua. This time though, it only took 560ms to parse, so just your casual 30x speedup. Although the API will differ from the pull request, I believe this to be an invaluable addition to the crate.

For those that are curious to see exactly how the two implementations differ, or wish to test it out themselves, below is a link to both the implementations, which will lead you to the repository:

If there are any questions as to my findings, please do let me know! I am happy to answer!

Oh, and here are the two flamegraphs:

without_pratt with_pratt

Zij-IT avatar Aug 13 '22 21:08 Zij-IT

@zesterer @alvra is there any progress on getting this into master?

JohnDowson avatar Dec 08 '22 07:12 JohnDowson

Not yet, unfortunately. I'm really wanting for time up until Christmas. I think I'll have time to attend to this in January. If somebody else is interested in guiding this through review and making considerations about the API (which I feel still needs a bit of work to avoid being totally at odds with the API of the rest of the crate), I'm happy to give assistance.

zesterer avatar Dec 08 '22 09:12 zesterer

Closable with #464?

Zij-IT avatar Jul 09 '23 08:07 Zij-IT