STL icon indicating copy to clipboard operation
STL copied to clipboard

<regex>: vNext overhaul

Open MikeGitb opened this issue 6 years ago • 9 comments

Just a question: It seems an ABI breaking release is on the (distant) horizon and I was wondering if you'll use the opportunity to overhaul std::regex or if you have practically given up on it due to the vicious cycle of small user base and bad performance.

Also tracked by Microsoft-internal VSO-110128 / AB#110128 and VSO-177627 / AB#177627.

vNext note: Resolving this issue will require breaking binary compatibility. We won't be able to accept pull requests for this issue until the vNext branch is available. See #169 for more information.

MikeGitb avatar Dec 22 '19 22:12 MikeGitb

We want to overhaul regex because this will be our only opportunity to fix its longstanding correctness and performance issues for the next N years, but first we need to decide how to do so. We could:

  • Write a new implementation from scratch, with community input.
  • Use an existing Standard-compatible implementation; my vague understanding is that libc++'s implementation is reasonably correct but not especially high performance, while Boost's implementation may have accumulated more performance improvements but also has non-Standard behavior and Boost dependencies that would need to be excised.
  • Adapt a high-performance engine to the Standard's semantics and conventions.

StephanTLavavej avatar Jan 05 '20 21:01 StephanTLavavej

Glad to hear it. I can't really give an informed opinion on which strategy would be the best. I'm just a user.

MikeGitb avatar Jan 06 '20 01:01 MikeGitb

The C++ standard derives the ECMAScript standard for the default regex specification, so it seems like adapting the regex code from a JavaScript engine (like v8 or Chakra) for the additions and additional grammars supported sounds like a viable option.

sylveon avatar Jan 06 '20 01:01 sylveon

Note that licensing is important; we can consume both Boost and Apache License v2.0 with LLVM Exception. Before attempting to consume code from a JavaScript engine, we would need to investigate its license for viability.

StephanTLavavej avatar Jan 06 '20 20:01 StephanTLavavej

v8 is BSD, Chakra is MIT (Chakra being a Microsoft project)

sylveon avatar Jan 06 '20 20:01 sylveon

The following Microsoft-internal bugs are associated with this issue:

  • VSO-110128 "<regex>: regex engine is extremely poor, resulting in stack overflows and other runtime failures with reasonable input"
  • VSO-177627 "<regex>: _Nfa tree should be const after construction is complete"
  • VSO-186671 "User Story: RegEx Refactoring (Bin Compat Breaking Changes)"

BillyONeal avatar Jan 22 '20 23:01 BillyONeal

Did I hear correctly, that the committee wants to deprecate std::regex? In that case it probably doesn't make sense to invest too much time into regex beyond fixing bugs.

MikeGitb avatar Mar 03 '20 11:03 MikeGitb

Even if <regex> ends up deprecated in the standard there's still a lot of code out there using it that don't deserve to be bitten by (1) our multiline nonconformance, (2) our 100X+ perf penalties, (3) other bugs.

The C++ standard derives the ECMAScript standard for the default regex specification, so it seems like adapting the regex code from a JavaScript engine (like v8 or Chakra) for the additions and additional grammars supported sounds like a viable option.

Sadly ECMAScript is but one of the 7 (yes, 7) grammars supported in the standard. :(

BillyONeal avatar Mar 03 '20 19:03 BillyONeal

Removing decision needed as vNext supersedes it.

StephanTLavavej avatar Nov 09 '22 22:11 StephanTLavavej

As discussed on discord, me and @muellerj2 feel that the best way forward here (when, or if, it ever becomes relevant) is to make the std::regex object contain only a pointer to a virtual base class, which contains a matcher function and no other public interface. This function would have exactly one implementation in each Visual Studio version, but may be replaced between versions, allowing us to replace various internals without worrying about ABI breaks (as long as we leave the base class alone).

std::regex_match is templated on the iterator type; therefore, I propose that for non-contiguous iterators, regex_match reads the first 100 characters into a stack-local vector<CharT>, and passes that as a span<CharT> to the virtual matcher function. The function also has a callback function as argument; if the matcher reaches the end of the span, it calls the callback, which reads another 100 characters, appends to the same vector, and returns a span representing the new vector contents (or reports end of string); the matcher then switches to using the new span. The matcher will only return size_t offsets; converting that back to iterators is the caller's job. (For contiguous iterators, the initial span will be the entire thing, and the callback just returns end of string immediately.)

(Various details are negotiable, or intentionally left vague at this point, of course.)

Alcaro avatar Oct 18 '25 14:10 Alcaro

As a further small addition, I think we could avoid the virtual function call for the specializations std::regex and std::wregex by moving the object code of parser and matcher for these two specializations into the STL dll or static library. This would also ensure that there is no mismatch between parser and matcher.

This only applies to specializations we move into the STL dll or static library. The generic implementation of std::basic_regex should still use the virtual base class technique.

muellerj2 avatar Oct 18 '25 15:10 muellerj2