How do I capture identifier string sans reserved words
Hi,
I am trying to parse a language with a set of reserved words and identifiers, just like C or C++. Now I am running into a strange issue. I have expression rules as following,
identifier = !(reserved) ~r"[a-zA-Z_][a-zA-Z_0-9?]*" reserved = "if"/ "while" / "case" ... while_expr = "while" "(" identifier ...
Now, when parsing the following text, while ( x = 1 ) whilevar = 10
It is supposed to recognise the sting "whilevar" as an identifier, instead the parser is expecting a "(" after "while" in "whilevar" string based on the "while_expr" rule.
Am I defining the identifier expression rule incorrectly?
Or is there a way to specify precedence to complete the identifier expression rule before it attempts to complete other expression rules?
I have spent quite a bit of effort in defining the whole grammar. It is working fine except this one. I am really struck here, any identifier which starts with a reserved word substring is not getting parsed correctly.
Kindly respond ASAP.
Thanks, Ravi
I've found myself cheating when faced with this problem - what I do is build a library of reserved keywords (if, while, case, etc) and define some unlikely-to-be-reproduced version of them (Δ__IF__Δ, Δ__WHILE__Δ, Δ__CASE__Δ ) and perform a pre-process global file-replace (ensuring whitespace or some kind of delimiter is in play). I tend to use unicode-Greek characters as a personal preference (and because you're less likely to see these out in the wild) but that choice is up to you.
This way, your reserved keywords are explicitly matched and easily identified without there being (so much) danger of an accidental clash. It's also helpful as I can maintain a separate list of reserved keyworkds/functions outside of the grammar, and maintain that list more easily should it change.
In your example the text the parser would finally read would look like:
Δ__WHILE__Δ ( x = 1 ) whilevar = 10
Your grammar file for reserved words would be some regex that looks for your reserved pattern - like:
reserved = ~"\Δ__[\w]+__\Δ"
the end result being there'd be less chance of a clash.
This is probably less elegant a solution than what's possible, but it ended up saving me some time.
I'd like to hear/see any more parser/grammar-centric solutions as they'd feel more pure. But maybe a pre-processor is one way to approach this kind of issue?
I would take advantage of Parsimonious' infinite lookahead. This is one of the great advantages of PEGs. Further up in your grammar, where you describe the statements or expressions that while or whilevar can be part of, use alternation to try one, and then, if that fails, fall through to the other:
expression = while_expr / assignment
Yes, this will find the "while", expect a "(", but then not find it and backtrack, next trying assignment. That sort of strategy should get you what you want without any preprocessing.
Thanks Tom and Erik for the detailed replies. I wish I received these last year. In the language I am trying to parse, while keyword is always post fixed with "(". So changing reserved word from "while" to "while(" worked. identifier = !non_id_str ~"[a-zA-Z_][a-zA-Z_0-9?]*" _ non_id_string = "while(" / "if(" / "for(" / ... while = "while(" _ cond _ stmts ")"
Glad you figured out a solution!
For lua: identifier = ~"(?!\b(?:and|break|do|else|elseif|end|false|for|function|goto|if|in|local|nil|not|or|repeat|return|then|true|until|while)\b)[a-zA-Z_][a-zA-Z0-9_]*"
This makes it absolutely impossible for goto to ever be an identifier, I think.
Indeed, keywords are so pervasive in grammars, there should be a doc entry for how to do it. Indeed, would be nice to have a slightly prettier version of the above. If a new one is added, it would be easy to miss the update to this rule.