STL icon indicating copy to clipboard operation
STL copied to clipboard

`<regex>`: Optional empty repetitions are illegal

Open Alcaro opened this issue 7 months ago • 3 comments

Describe the bug

NOTE 4 Step 1 of the RepeatMatcher's d closure states that, once the minimum number of repetitions has been satisfied, any more expansions of Atom that match the empty String are not considered for further repetitions.

~JS spec, https://262.ecma-international.org/5.1/#sec-15.10.2.5

MS-STL currently does not comply with that clause (and neither do libstdc++ or libc++).

Command-line test case

#include <regex>
#include <stdio.h>

int main() {
    try {
        std::string s{"b"};
        std::regex r{"(a*)*"};
        std::smatch m;
        bool result = std::regex_search(s, m, r);
        printf("regex_search: %d\n", result);
        for (unsigned i = 0; i < m.size(); ++i) {
            printf("m[%d]", i);
            if (m[i].matched) {
                printf(".str(): \"%s\"\n", m[i].str().c_str());
            } else {
                puts(".matched: false");
            }
        }
    } catch (const std::exception& e) {
        printf("Exception: %s\n", e.what());
    }
}

This also affects whether a(b?)+c\1d matches abcd (it shouldn't), though various recently-fixed regex bugs make it hard to demonstrate on Godbolt.

Expected behavior

Group 0 - empty string Group 1 - no match

STL version

@muellerj2 says it's still present on main as of today

Additional context

https://godbolt.org/z/s7ejf7GKv

https://github.com/llvm/llvm-project/issues/133314 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=120212

Alcaro avatar May 10 '25 19:05 Alcaro

This also affects (a?)?, or potentially any expression containing capture groups quantified with ?. In such cases, the natural fix also involves disabling the rewriting of ? in _Parser::_Quantifier().

Unfortunately, this has a bad side effect: It can cause the simple loop optimization to be disabled, which means some additional regexes potentially become susceptible to stack overflow (as in #997, #1528). Specifically, this makes test_VSO_208146_regex_should_special_case_optional_as_non_quantifier() in VSO_0000000_regex_use run into a stack overflow.

I can come up with one alternative: Adding a new flag to the _Node_if node, which means that a match should be rejected if this branch matched an empty string. But this complicates the matcher a lot more because it has to handle one more special case and maintain more state to determine whether the match for this branch was empty.

Obviously, the true long-term fix is to make the matcher non-recursive, but that solution isn't available to us now. (And even then, there would still be this side effect if an old matcher were to be mixed with a new parser.)

muellerj2 avatar May 11 '25 11:05 muellerj2

In my regex engine, I implemented that rule by piggybacking on capture groups. If the current position is equal to the relevant saved pointer, backtrack. (Then discard the extra capture groups when matching is done.)

Your choice if that's a good solution, or even possible in your architecture and under your ABI constraints, of course.

Alcaro avatar May 11 '25 11:05 Alcaro

I found a solution: I can actually enable the simple loop optimization even if the loop logic is exerted in the specific case when the subexpression is branchless and all outer loops are allowed to match at most once (so when all outer loops were defined with ?, {0}, {0,0} {0,1}, {1} or {1,1} quantifiers).

I can't enable it for branchless loops when the outer loops match more than once because this makes the simple loop optimization go haywire: The boundaries of capture groups aren't reset at all when control is returned to _Do_rep0(), so the starting positions of capture groups in surrounding loops get corrupted when the outer loop is repeated one more time. I could fix this in the current matcher, but that still wouldn't allow us to enable the simple loop optimization because this wouldn't fix past matchers.

In my regex engine, I implemented that rule by piggybacking on capture groups. If the current position is equal to the relevant saved pointer, backtrack. (Then discard the extra capture groups when matching is done.)

This is kind of the alternative I mentioned above, but it's not straightforward because the parser is not allowed to emit these capture groups explicitly as capture groups (ABI). These artificial capture groups would either require an extension to _N_if NFA nodes to store a unique identifier or would have to be a matcher-only concept, requiring some map from NFA node pointers to the starting positions of the capture groups.

For loops specifically, there already exist such artifical capture groups (see _Loop_vals_v2_t), but the original point of this optimization was to avoid the loop logic in the first place.

muellerj2 avatar May 11 '25 13:05 muellerj2

I could fix this in the current matcher, but that still wouldn't allow us to enable the simple loop optimization because this wouldn't fix past matchers.

In #5456 you renamed the matcher to _Matcher2, so (until that ships) I don't think you need to worry about current/past mixing. The class name affects the mangled names of all member functions, preventing mixing.

StephanTLavavej avatar May 15 '25 09:05 StephanTLavavej

In https://github.com/microsoft/STL/pull/5456 you renamed the matcher to _Matcher2, so (until that ships) I don't think you need to worry about current/past mixing

No, this doesn't help because this specific problem is about mixing a new parser with an old matcher. The new parser would enable the single loop optimization by setting a flag on the NFA and the old matcher would obey the setting and produce crazy results.

I could set a new flag on the NFA nodes that is unknown to past matchers, but I would like to avoid the introduction of new flags for performance-related optimizations until the matcher has been changed to be non-recursive and we have a clearer perspective what optimizations will help in the long term.

muellerj2 avatar May 15 '25 18:05 muellerj2