`<regex>`: Improve matcher's skip optimization

Open muellerj2 opened this issue 8 months ago • 0 comments

Split from #5452. Other than the performance fix in #5457, I listed a few optimizations that could be investigated for _Skip in this comment:

The evaluation of _N_if should really do a breadth-first rather than a depth-first search. I think the most effective way to achieve that in the long term is to construct something equivalent to an _N_class node (or perhaps two) that matches all possible characters that could appear at the start of a match of a regular expression, and then make _Skip only evaluate this special node. This special node could either be computed at the end of parsing (to be stored in an extended version of the root node) or at the beginning of matching.

The _N_str implementation should be translated into an actual call to find or find_if (depending on syntax option flags) or perhaps even be lowered to a call to search.

_Skip should not just always give up when encountering an _N_endif (or _N_rep) node, but this requires some analysis when it is appropriate to keep going.

I also mentioned dynamic programming/memoization as a potential alternative to breadth-first search in a follow-up comment:

We memoize the first possible match for each branch and only evaluate again when we move beyond this position. That means we need to retain some kind of map between nodes and positions. Recognizing if we moved beyond a memoized position is complicated by the fact that the iterators on the searched string might not be random access iterators.

But I no longer think this alternative is viable because we can't retain this cache between regex_search calls, so memoization is not sufficient to avoid quadratic running time in all cases when the skip optimization tests all possible branches.

In addition, this approach also penalizes cases where regex_search is only called once or so and the regex (but not the first branch in the regex) matches close to the start of the searched string. In this case, this approach spends up to a linear amount of time to test the first branch, even though most of the string doesn't have to be read at all.

May 03 '25 14:05 muellerj2