parsimonious icon indicating copy to clipboard operation
parsimonious copied to clipboard

Matching being too greedy

Open rowlesmr opened this issue 3 years ago • 3 comments

I've just found this library, and I'm trying to implement parsimonious to at least tokenise my input files. There is a grammar I'd like to implement, but the matching seems to be a little too greedy.

One example of this nature is below.

Strings contained in quotes are able to contain the quote character itself. To be a valid string-termination quote mark, the quote mark must be followed by whitespace.

from parsimonious.grammar import Grammar
quote_grammar = Grammar(
    r"""
    SingleQuotedString = single_quote Char* single_quote WhiteSpace
    WhiteSpace = ( SP / HT / EOL )+
    single_quote = "'"  
    
    Char = ~"[a-zA-Z0-9' \t]"
    
    EOL = ~'[\n\r\f]+'
    SP = ' '
    HT = '\t'
    """
)

print(quote_grammar.parse("'don't' "))

This fails with

parsimonious.exceptions.ParseError: Rule 'single_quote' didn't match at '' (line 1, column 9).

Implementing the first rule as a straight regex does work

quote_grammar = Grammar(
    r"""
    SingleQuotedString = ~"['][a-zA-Z0-9' \t]*['][ \t\n\r\f]+"
    """

 -->
<RegexNode called "SingleQuotedString" matching "'don't' ">

Where can I start looking on how to fix this behaviour (if indeed it is a bug...)

rowlesmr avatar Feb 26 '22 14:02 rowlesmr

Char* in SingleQuotedString = single_quote Char* single_quote WhiteSpace appears to be the thing that is "too greedy" because your second single_quote has nothing to match on at the end of your string ("column 9", the last index of the string). According to some resources I read

As a regular expression \[.*\] matches the longest possible text between '[' and ']'. As a PEG it never matches anything, because a PEG is deterministic: .* consumes the rest of the input, so \] never matches. As a PEG this needs to be written as: \[ ( !\] . )* \] (or \[ @ \]). [0]

Note that the regular expression does not behave as intended either: in the example * should not be greedy, so \[.*?\] should be used instead.

It looks like you should make use of lookaheads to get the behaviour you need; something like

SingleQuotedString = single_quote (!end_single_quote Char)* end_single_quote
end_single_quote = single_quote Whitespace

[0] https://nim-lang.org/docs/pegs.html

comalice avatar Mar 01 '22 15:03 comalice

I see. Make the two token search for "' " into a single token search by making it it's own match.

Ta a lot.

I'll try it out.

rowlesmr avatar Mar 01 '22 23:03 rowlesmr

Note that this is regular, so you could do this with just a re.

import re
pattern = re.compile(
    r"""
        (  # start capture group
            (?<!\w)'  # Starting ', which should not be preceded by a word character.
                      # Uses negative lookbehind.
            (?:[^']|'(?=\w))*  # A non-' character, or a ' which is followed by a word character.
                               # Uses positive lookahead.
            '(?!\w)  # Closing ', not followed by a word character.
                     # Uses negative lookahead.
        )  # end capture group
    """,
    flags=re.VERBOSE
)
assert pattern.findall("'do' 'don't'") == ["'do'", "'don't'"]
assert pattern.findall("'don't'") == ["'don't'"]
assert pattern.findall("'don't") == []

(Note also that this general approach may break on some slang words which end in ', like goin'.)

lucaswiman avatar Mar 01 '22 23:03 lucaswiman