*? matching sometimes matches too much
Given a file:
foo
bar
baz
quux
bar
baz
quux
matching with ug 'f.*(\n.*)*?\nq' erroneously grabs all lines.
Same happens with f.*(\n.*)*?\nqu and f.*(\n.*)*?\nquu — but f.*(\n.*)*?\nquux works correctly.
I disagree that "matching with ug f.*(\n.*)*?\nq erroneously grabs all lines", because the POSIX form of (lazy) pattern matching differs from the Perl (lazy) pattern matching. Just use f.*(\n.*?)*?\nq as a pattern with a nested lazy quantifier or use option -P for Perl pattern matching. Perhaps it is possible to force the POSIX lazy matcher to more closely match to Perl behavior, though the POSIX "leftmost longest match" rule will always be a deciding factor.
There are some subtle differences, which I will explain.
Update: a new version with updated support for POSIX regex lazy quantifiers will be released that makes POSIX lazy quantifiers practically behave the same as Perl lazy quantifiers. See further below
Ugrep (through RE/flex) adds lazy-quantifier capability which isn't well defined in POSIX for some non-trivial patterns such as lazy/greedy nesting. You have a greedy repeat nested inside a lazy repeat. You may want to use 'f.*(\n.*?)*?\nq' as a pattern with a nested lazy quantifier or use option -P.
To get the Perl regex behavior, always use option -P for Perl regex matching. Otherwise you will get the POSIX (grep) regex matching behavior, but ugrep also adds lazy quantifiers that GNU/BSD grep do not support (i.e. without -P Perl matching).
Looking into the details of DFA/POSIX matching versus Perl matching for this particular case reveals some interesting differences. Note that POSIX "leftmost longest match" rule isn't the simple answer to this observation, but it is related. This rule makes POSIX inherently a bit more greedier, e.g. foo(b|.*) matches foobar fully with POSIX but only the first foob part of foobar with Perl matching. Likewise, there are subtle differences in the way lazy quantifiers work with POSIX. The inner \n.* gobbles up a full line "greedily", which has a stronger preference for the POSIX matcher to satisfy compared to the short \nq that lazily comes after to stop the matcher. But for the case that \nquux is specified as in f.*(\n.*)*?\nquux, then the greedy \n.* does not have priority any longer over the \nquux that lazily comes after to stop the matcher. In that case laziness wins over greediness. But on the other hand this is arguably counter intuitive. I will investigate this further.
A quick update.
I plan to work on this issue to improve lazy quantifiers in POSIX mode (the default ugrep regex mode, like GNU grep).
Nobody has seriously attempted to implement lazy quantifiers in POSIX regular expressions, because POSIX sub-matching is ambiguous and ambiguity is "not allowed" to put it a bit simplistically. As I've pointed out, lazy quantifiers are possible for optionals and repetitions, but not for alternations like |? that cannot be lazy in POSIX.
It is provably true that POSIX lazy quantifiers cannot in all cases behave the same exact way that Perl regex lazy quantifiers work. However, there is an opportunity to get closer in similarity by reworking the RE/flex algorithm that annotates the DFA of the regular expression to match regex efficiently.
The tricky parts is when greedy and lazy quantifiers clash, either when used in nested forms (as in your example), or when used in sequence in a regex (which we can handle OK).
I wrote the first algorithm in 2016 for RE/flex to support POSIX lazy quantifier transformation to DFAs. There was no other algorithm or some (hacked) implementation publicly available that could do this. POSIX lazy quantifiers just don't exist, unlike Perl regex lazy quantifiers.
The improvements I am making are more closely aligned with Perl regex lazy forms. Perl pattern matching is depth-first using NFAs, essentially, which makes lazy quantifier matching trivial (just ignore lazy parts, then backtrack to match lazy parts). POSIX pattern matching is breadth-first using DFAs, essentially. DFAs have a speed advantage (linear time to match a pattern, i.e. never backtracks to match). So lazy quantifiers cannot simply be implemented by ignoring the lazy part and then backtrack when needed.
I am currently addressing a deficiency in my current algorithm implemented in RE/flex and ugrep that makes some regex patterns with lazy quantifiers match too much or too little. It is an elegant algorithm, but it is finicky with respect to lazy-greedy conflicts in a regex.
To illustrate the progress I've made, a simplified form of your regex is (c[ab]*)*?cb using abc only, where c stands for \n and [ab] stands for .. The improved algorithm generates a DFA that captures this lazy regex beautifully:
The state annotations
? mark lazy positions in the regex string. Note that the DFA pattern ends in a final state for cb as should be.
Good news. After refining the DFA construction algorithm, I am now able to verify that all regex in my test set behave the same as Perl regex lazy patterns. The DFA construction is still cheap to annotate states during the construction. No additional passes are necessary over the DFA graph.
Here is a sample of my test cases, if anyone is interested. I have many more, but these are short and typically challenging:
- Lazy optional X?
(a|b)??a
a(a|b)??(?=a|ab)|ac
(a|b)??(a|b)??aa
(a|b)??(a|b)??(a|b)??aaa
a??b?a
a??b?b
- Lazy closure X*
a*?a
a*?|b
(a|bb)*?abb
ab*?|b
(ab)*?|b
a(ab)*?|b
(a|b)*?a|c?
a(a|b)*?a
a(a|b)*?a??|b
a(a|b)*?a?
a(a|b)*?a|a
a(a|b)*?a|a?
a(a|b)*?a|a??
a(a|b)*?a|aa?
a(a|b)*?a|aa??
ab(ab|cd)*?ab|ab
(a|b)(a|b)*?a|a
(ab|cd)(ab|cd)*?ab|ab
(ab)(ab)*?a|b
a?(a|b)*?a
(?m)^(a|b)*?a
(?m)(a|b)*?a$
(a|b)*?a\\b
(?m)^(a|b)*?|b
- Lazy positive closure X+
a+?a
(a|b)+?
(a|b)+?a
(a|b)+?a|c?
(ab|cd)+?ab|d?
(ab)+?a|b
(ab)+?ac
ABB*?|ab+?|A|a
(a|b)+?a|a
(?m)^(a|b)+?a
(?m)(a|b)+?a$
- Lazy iterations {n,m}
(a|b){0,3}?aaa
(a|b){1,3}?aaa
(a|b){1,3}?aaa
(ab|cd){0,3}?ababab
(ab|cd){1,3}?ababab
(a|b){1,}?a|a
(a|b){2,}?a|aa
- Lazy nestings
(c[ab]*)*?cb|bb
(c[ab]*)[abc]*?cb|bb
(a+)??aaa
((a|b)??b)*
((a|b)*?b)*
((a|b)+?b)*
((a|b)??b)?
((a|b)*?b)?
((a|b)+?b)?
((a|b)??b)+
((a|b)*?b)+
((a|b)+?b)+
((a|b)+?)?
((a|b)+?)+