parglare
parglare copied to clipboard
A terminal that doesn't match specific strings
I'd like to define a terminal that matches words except specific words.
This is why: trying this code
import parglare
grammar = r"""
Sentence: The? object_name=Identifier "is" A Identifier DOT;
Identifier: IdentifierWord+;
terminals
The: /(?i)The/;
A: /(?i)An?/;
IdentifierWord: /\w+/;
DOT: ".";
"""
text = """The apple is a fruit."""
g = parglare.Grammar.from_string(grammar)
p = parglare.Parser(g, debug=True, consume_input=False)
result = p.parse(text)
print(result)
fails, expectedly, with Can't disambiguate between: <IdentifierWord(The)> or <The(The)>, because IdentifierWord matches everything. So what I'd like to do is have IdentifierWord not match certain things, such as "the" and "a". However, when I try this, by changing the definition of the IdentifierWord terminal to IdentifierWord: /(?!The|a)\w+/; so that it uses a negative lookahead to exclude certain words from matching, then the above code fails with
Error at 2:4:"\nThe **> apple is a" => Expected: IdentifierWord but found <A(a)>
I don't understand why this is. It's finding the "a" at the beginning of "apple" and treating it as an "a". I don't know if I'm solving this the best way; is there some other way I should be structuring this sort of grammar, or maybe some better way of defining a terminal that matches all words except certain ones?
Word apple is not matched by (?!The|a)\w+. It is because the negative assertion will match a at the beginning. What you need to do it to make sure that the negative assertion take into account the word boundary. Try this (?!(The|a)\b)\w+.
aha! Again, much appreciated; I understand now what I was doing wrong. Thank you!