nom icon indicating copy to clipboard operation
nom copied to clipboard

Precedence parsing

Open cenodis opened this issue 3 years ago • 12 comments

This PR introduces a new module with parsers that can handle operator precedence in expressions. It was written with the intent to allow quick and easy translation of a usual operator precedence table into a nom parser. Individual operators and operands are parsed via nom parsers allowing for easy integration with existing parsers.

Example usage (very simple integer calculator):

fn parser(i: &str) -> IResult<&str, i64> {
  precedence(
    unary_op(1, tag("-")), //prefix operators
    fail, //postfix operators
    alt(( //binary operators
      binary_op(2, Assoc::Left, tag("*")),
      binary_op(2, Assoc::Left, tag("/")),
      binary_op(3, Assoc::Left, tag("+")),
      binary_op(3, Assoc::Left, tag("-")),
    )),
    alt(( //operands
      map_res(digit1, |s: &str| s.parse::<i64>()),
      delimited(tag("("), parser, tag(")")), //subexpression handled via recursion
    )),
    |op: Operation<&str, &str, &str, i64>| { //evaluating the expression step by step
      use nom::precedence::Operation::*;
      match op {
        Prefix("-", o) => Ok(-o),
        Binary(lhs, "*", rhs) => Ok(lhs * rhs),
        Binary(lhs, "/", rhs) => Ok(lhs / rhs),
        Binary(lhs, "+", rhs) => Ok(lhs + rhs),
        Binary(lhs, "-", rhs) => Ok(lhs - rhs),
        _ => Err("Invalid combination"),
      }
    }
  )(i)
}

TODOs

  • [x] Better documentation. All elements of this module are documented and I cant think of anything more to add. ~~I have done my best to write good documentation but some feedback would be appreciated. I still feel like some parts could be improved but I have no concrete ideas right now.~~
  • [x] Negative tests. Tests for parser failures now exist. ~~Currently the tests for precedence only check for positive results (i.e. successful parses). I would like to get some cases with invalid input as well. I have looked at how the existing tests handle this but the current error handling in nom escapes me. Help with this would be nice.~~
  • [x] Improving API. A "fail" parser now exists in nom and can be used to specify "no operators of this kind". I see no other significant problems with the API. ~~The current API I have come up with feels solid for the most part but I still think theres some room for improvement. Especially when handling languages that may not have certain classes of operators (for example a language with no postfix operators). This necessitates the use of a parser that always fails, but a parser with that functionality does not exist in nom so workarounds like verify(tag(""), |_: &str| false) are needed.~~
  • [x] Recipes/Examples. The tests now have an example containing AST generation, function calls and ternary operators using this parser. ~~I would like to add more examples into the recipes or example section. Especially for more involved expression containing things like function calls and ternary operators.~~

Open Questions

  • [x] How should this parser handle "ambiguous precedences"? ~~(see this comment for more details about this)~~ Resolution: The current behaviour is sufficient. See here for reasoning.

cenodis avatar Aug 15 '21 15:08 cenodis

More details about the "ambiguous precedences".

Edit: This section was rewritten entirely to hopefully make the problem a bit more clear.

Assume the following precedences:

Operator Position Precedence Associativity
++ (Increment) Postfix 3 N/A
- (Negation) Prefix 2 N/A
** (Power) Binary 1 Right

Based on the above the following expressions would get parsed like this:

  • -a**a => -(a**a) (Power has the highest precedence)
  • a++**a => (a++)**a (Postfix must be evaluated first to preserve its position)

Based on the above one might expect the expression: -a++**a to be parsed like so: -((a++)**a) (The unary minus gets handled as in -a**a and the increment gets handled as in a++**a)

However the expression is actually ambigious and can also be parsed as: ((-a)++)**a

This parser evaluates the expression as the latter because of its behaviour to evaluate operators as soon as possible. The parser moves left to right meaning once it reaches the postfix operator it sees the following input: -a++ Based on the precedence table above the Negation needs to be evaluated before the Increment. So it evaluates the negation now, turning -a into its result (which we will refer to as b). So the input now looks like this: b++ Then it reads the next operator (and because its a binary operator we can also include the next operand) resulting in this input: b++**a. Which gets evaluated the same way as the a++**a above, turning it into this: (b++)**a. If we now replace b with its original input we see how the parser arrives at its output: ((-a)++)**a

Im not really sure how to solve this, or if it should be solved at all. Technically both are valid parses, it just might not be what one would expect at first glance. To my knowledge there exists no standard on how such an expression should be evaluated with the above precedences so any "expected parse" would be more gut feeling than objective truth (Said gut feeling might also be the consequence of the fact that the postfix operators usually have the highest precedence).

I could just declare this behaviour intentional and document it which would probably be the simplest solution (and the parser is consistent in its behaviour with such precedences so once you are aware of it there are no real suprises anymore).

Trying to implement the above precedence via the naive method outlined in the arithmetic example just causes the parser to fail with the above input. Which is also not helpful because -a++**a is a valid expression if you are just looking at the operators available.

The only common postfix operators in standard math I know are function calls which have a higher precedence than the power or negation operator. Languages which have the power operator and give it precedence over the unary minus (like Python) also solve the problem this way (by giving the postfix operators the highest precedence of all). So I dont expect this to be a significant problem in practice but its at least theoretically possible to write operator precedences like this.

@Geal I have seen you link a haskell parser that seems somewhat similar to the parser in this PR (at least API wise). Unfortunately I know nothing about haskell. On the off chance that you might be more familiar with it: Do you know how that parser would handle cases like this?

cenodis avatar Aug 15 '21 15:08 cenodis

Due to the lack of contrary opinions I have decided to label the current behaviour with ambiguous expressions as intentional.

This was done with the following reasons:

  • Nom makes no guarantees regarding the handling of ambiguous grammars.
  • Other parser exhibit similar behaviour (alt will just pick whichever parser was specified first and swallow any ambiguity).
  • The parser is consistent with its handling of ambiguous grammars. So there are no surprises once the user is aware of it.
  • I see little practical value in adding additional special handling for ambiguous grammars. Such handling would be extremely specific to the language being implemented and at that point it would most likely be easier to just manually write an expression parser that is purpose-built for that language.

cenodis avatar Aug 19 '21 13:08 cenodis

@Stargateur

Why put this in nom ?

There are no guidelines on what parser specifically belong in nom or not (or if it exists I havent found it in the contribution guide). To quote Geal:

more support for text parsing [...] handling precedence in expressions (something like https://hackage.haskell.org/package/parsec-3.1.14.0/docs/Text-Parsec-Expr.html would be nice) [...] This could live in nom, or in a separate crate with a faster release cycle [...]

To me this seperate crate still sounds like an open decision. And since I dont know of any such crate existing as of yet I figured I would take my chances and get it into standard nom.

specially you force complete input

Could you tell me where Im forcing complete input? If one of the operand or operator parser returns Needed it should just bubble to the caller. As far as I can see this parser should be compatible with both complete and streaming parsers and which ones are used depends solely on what type of parsers are passed to precedence. If you could give me specific lines via a review I will gladly try and fix anything that causes an incompatability with streaming parsers.

cenodis avatar Aug 19 '21 20:08 cenodis

Could you tell me where Im forcing complete input? If one of the operand or operator parser returns Needed it should just bubble to the caller. As far as I can see this parser should be compatible with both complete and streaming parsers and which ones are used depends solely on what type of parsers are passed to precedence. If you could give me specific lines via a review I will gladly try and fix anything that causes an incompatability with streaming parsers.

nevermind I miss understand.

Stargateur avatar Aug 19 '21 20:08 Stargateur

That's a lot to pass into a single function at once, especially since some of the arguments correspond to aspects of the precedence parser that not everyone will use. Maybe the API should use the builder pattern instead?

eaglgenes101 avatar Aug 19 '21 21:08 eaglgenes101

@eaglgenes101

That's a lot to pass into a single function at once.

5 parameters in total. 3 of which are always required.

builder pattern

I dont think making a builder pattern for 2 optional parameters would be that great. A builder pattern would also either replicate the current API pretty closely (i.e. a single parser for each category) or I would have to use dynamic dispatch (to maintain a list of parsers for each category). And I would really like to avoid dynamic dispatch if possible. I did some small experiments with builders while developing this and ultimately abandoned it because the builder didnt really help and just added bloat to the module.

Also there is nothing preventing you from extracting the parsers for the individual operand or operator groups into their own functions and then just passing those to the precedence parser. Same with the fold function. I just put it all in one place to give a comprehensive example of how it works.

Edit: Looking over the docs 5 parameters dont seem that out of line. There are plenty of parser that take 3 or 4 parameters, not counting alt and permutation which can (kind of) take up to 21 parameters. There is even an already existing parser with 5 parameters, fold_many_m_n.

cenodis avatar Aug 19 '21 22:08 cenodis

hi! just fyi, I'll review that on saturday or sunday, it looks like a great addition to nom :)

Geal avatar Aug 24 '21 14:08 Geal

@Geal Any update on this?

cenodis avatar Sep 01 '21 07:09 cenodis

This looks very useful, it would certainly clean up some of my parsers. @cenodis would you consider releasing this as a separate crate, if it doesn't look like it's making it into nom proper?

mullr avatar Mar 02 '22 23:03 mullr

Great, a feature that I need is stuck in a PR that's been sitting around for 6 months. @Geal can you please take a look at this?

LoganDark avatar Mar 14 '22 08:03 LoganDark

It's been a weird 6 months for my open source work honestly 😅 But it's time for a new major version and precedence parsing will definitely go there

Geal avatar Mar 14 '22 09:03 Geal

In case anybody else needs it: I turned the contents of this PR into a little crate that works with nom 7: https://github.com/mullr/nom-7-precedence. I'm looking forward to being able to delete it in the future, once nom 8 is a thing!

mullr avatar Apr 19 '22 18:04 mullr