wdl icon indicating copy to clipboard operation
wdl copied to clipboard

Error parsing ternary operator assignment

Open rickymagner opened this issue 1 year ago • 4 comments

Hi folks, I think I found a subtle bug in parsing a particular type of ternary operator assignment. Here are the steps to reproduce:

  1. Make the following file called MinimalParse.wdl:
version 1.0

workflow MinimalParse {
    input {
        Int num = 8
    }

    Int new_num=if(num>5) then 1 else 2  # Add space after `if` to parse with wdl crate
}
  1. Try to parse using the wdl crate. Here is a Rust snippet:
// Inside main
    let src = std::env::args().nth(1).expect("missing src");
    let contents = std::fs::read_to_string(src).expect("could not read file contents");

    // Generate the parse tree from the document source.
    let parse_result = grammar::v1::parse(&contents);
    let parse_tree = match parse_result {
        Ok(_) => {
            let inner = parse_result.unwrap();
            println!("Inner is: {:?}", &inner);
            let tree = inner.into_tree().expect("Cannot make tree");
            tree
        }
        Err(e) => {
            println!("{}", e);
            panic!()
        }
    };
  1. The printed output is:
Inner is: Result { concerns: Some(Concerns(NonEmpty { head: ParseError(Error { message: "The following tokens are required: singular_identifier.", location: Position(Position { line_no: 8, col_no: 32, byte_no: 107 }) }), tail: [] })), tree: None }
  1. Add a space after if in the original WDL, i.e. Int new_num=if (num>5) then 1 else 2 and rerun. It now parses successfully.

I'm not defending the gross-looking syntax where there's no space after if, but womtool validate passes and cromwell does in fact run the original WDL, so this seems to be a parsing bug.

Curiously enough, the bug doesn't seem to care about the other shifty looking lack of spaces. Adding a space before or after the = does nothing to fix this. It also doesn't seem to care if you use a space or not after if in normal blocks outside of a ternary assignment.

rickymagner avatar Apr 23 '24 14:04 rickymagner

@peterhuene would you mind taking a look at this one?

claymcleod avatar May 02 '24 14:05 claymcleod

The root cause of the issue is that the pest grammar defines the if rule like this:

 "if" ~ (WHITESPACE | COMMENT)+ ~ expression ...

This precludes if(expr). If we change the rule to:

 "if" ~ (WHITESPACE | COMMENT)* ~ expression ...

Then if(expr) would parse, but then this would also parse:

Int new_num=iftrue then 1 else 2

The fix will need to be:

"if" ~ (&ANY_TOKEN_THAT_MIGHT_START_AN_EXPRESSION | (WHITESPACE | COMMENT)+) ~ expression ...

Where ANY_TOKEN_THAT_MIGHT_START_AN_EXPRESSION would be the token set:

(
!
+
-
{
[

Parsing a CST with pest can be error-prone as it forces the rules to think about whitespace and comments as part of the rule.

peterhuene avatar May 02 '24 18:05 peterhuene

@peterhuene You might already see this, but reading your fix for the situation I think that would introduce more problems.

As I'm reading this, "if" ~ (ANY_TOKEN_THAT_MIGHT_START_AN_EXPRESSION | (WHITESPACE | COMMENT)+) ~ expression ..., the ANY_TOKEN rule could change the evaluation of the expression by moving a token with meaning out of the rule which gets evaluated. e.g. if!paired then... would be parsed into an ANY_TOKEN (!) and an expression (paired) which will have the opposite meaning of the intended expression, !paired. Which is not to say what you wrote is wrong, just not the entire fix, which speaks to a larger problem at hand.

Maybe I'm not saying anything you didn't already realize, but this example is proving to me that we need to really fight with pest to get it to parse as we want. It seems to me that pest is simply the wrong tool for the job. I'm moving further and further into the logos+rowan camp.

a-frantz avatar May 03 '24 17:05 a-frantz

You're definitely correct! In the grammar we'd need the & prefix on the rule to match it, but not consume a token, I think.

I also wouldn't call it ANY_TOKEN_THAT_MIGHT_START_AN_EXPRESSION, either 😄; just an example of what a fix might entail (i.e. we need to lookahead for any expression starting token).

I've updated the original suggestion to include the & so that there's no confusion.

peterhuene avatar May 03 '24 17:05 peterhuene

I can confirm that this is fixed with the new "experimental" parser implementation.

With our intention to move away from the pest-based parser, I don't think it's worth attempting a fix there.

When we switch parser implementations, I'll resolve this as fixed.

peterhuene avatar Jun 04 '24 21:06 peterhuene