parsimonious
parsimonious copied to clipboard
Matching being too greedy
I've just found this library, and I'm trying to implement parsimonious to at least tokenise my input files. There is a grammar I'd like to implement, but the matching seems to be a little too greedy.
One example of this nature is below.
Strings contained in quotes are able to contain the quote character itself. To be a valid string-termination quote mark, the quote mark must be followed by whitespace.
from parsimonious.grammar import Grammar
quote_grammar = Grammar(
r"""
SingleQuotedString = single_quote Char* single_quote WhiteSpace
WhiteSpace = ( SP / HT / EOL )+
single_quote = "'"
Char = ~"[a-zA-Z0-9' \t]"
EOL = ~'[\n\r\f]+'
SP = ' '
HT = '\t'
"""
)
print(quote_grammar.parse("'don't' "))
This fails with
parsimonious.exceptions.ParseError: Rule 'single_quote' didn't match at '' (line 1, column 9).
Implementing the first rule as a straight regex does work
quote_grammar = Grammar(
r"""
SingleQuotedString = ~"['][a-zA-Z0-9' \t]*['][ \t\n\r\f]+"
"""
-->
<RegexNode called "SingleQuotedString" matching "'don't' ">
Where can I start looking on how to fix this behaviour (if indeed it is a bug...)
Char*
in SingleQuotedString = single_quote Char* single_quote WhiteSpace
appears to be the thing that is "too greedy" because your second single_quote
has nothing to match on at the end of your string ("column 9", the last index of the string). According to some resources I read
As a regular expression
\[.*\]
matches the longest possible text between '[' and ']'. As a PEG it never matches anything, because a PEG is deterministic:.*
consumes the rest of the input, so\]
never matches. As a PEG this needs to be written as:\[ ( !\] . )* \]
(or\[ @ \])
. [0]Note that the regular expression does not behave as intended either: in the example
*
should not be greedy, so\[.*?\]
should be used instead.
It looks like you should make use of lookaheads to get the behaviour you need; something like
SingleQuotedString = single_quote (!end_single_quote Char)* end_single_quote
end_single_quote = single_quote Whitespace
[0] https://nim-lang.org/docs/pegs.html
I see. Make the two token search for "' " into a single token search by making it it's own match.
Ta a lot.
I'll try it out.
Note that this is regular, so you could do this with just a re
.
import re
pattern = re.compile(
r"""
( # start capture group
(?<!\w)' # Starting ', which should not be preceded by a word character.
# Uses negative lookbehind.
(?:[^']|'(?=\w))* # A non-' character, or a ' which is followed by a word character.
# Uses positive lookahead.
'(?!\w) # Closing ', not followed by a word character.
# Uses negative lookahead.
) # end capture group
""",
flags=re.VERBOSE
)
assert pattern.findall("'do' 'don't'") == ["'do'", "'don't'"]
assert pattern.findall("'don't'") == ["'don't'"]
assert pattern.findall("'don't") == []
(Note also that this general approach may break on some slang words which end in ', like goin'.)