souffle icon indicating copy to clipboard operation
souffle copied to clipboard

Support for regex engines other than `std::regex`

Open woodruffw opened this issue 3 years ago • 10 comments

Hi there!

I noticed that the C++ synthesized by Souffle uses std::regex, which can stack overflow (at least with libstdc++'s implementation) due to excessive backtracking:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61582

Here's the relevant synthesis fragment:

https://github.com/souffle-lang/souffle/blob/d9209ddda1b831659494f015638956dfb618ac98/src/synthesiser/Synthesiser.cpp#L2409-L2417

I'm not familiar with Souffle's regex engine requirements off the top of my head but, assuming that they don't include lots of extended (i.e., not purely regular) functionality, it might be possible to swap a different regex engine into regex_wrapper based on a build-time option. If so, re2 is a reasonable candidate (it doesn't perform backtracking and is well-tested).

woodruffw avatar Jun 16 '21 18:06 woodruffw

I don't have much experience in this space. Perhaps we have a macro/configuration to switch between different regex implementations.

However, we need to ensure that the synthesiser/interpreter use the same regex implementation when configured accordingly.

b-scholz avatar Jun 18 '21 11:06 b-scholz

We've currently got a variant of souffle using PCRE2. Might be worth doing a breakdown of alternatives or, as scholz suggested, cutting that corner of the system out to make it easy to swap/customise.

SanaaHamel avatar Jul 12 '21 17:07 SanaaHamel

Yeah, here are the big alternatives that I'm aware of:

  • PCRE/PCRE2: Stable, feature-equivalent with C APIs. Both generations of PCRE do backtracking, but are probably much better tuned than the std::regex implementation.
  • re2: Stable, not feature equivalent with a C++ API. re2 does not support backtracking and thus lacks many of the "advanced" (i.e., not purely regular language) features of other regex engines. On the other hand, it's extremely fast and does have all of the standard regular expression features.
  • Boost.Regex: I know this one exists, and that's about all I know about it. Boost (IME) tends to mirror standard C++ APIs, so this might not be the best general purpose implementation given std::regex's problems.

In general, being able to swap between engines would be nice -- my team's particular use case would benefit heavily from RE2's architecture, but we obviously don't want to hamper other's uses of Souffle by limiting the kinds of pattern expressions that are supported. I don't have a lot of cycles for this at the moment, but it's something I could take a stab at in the medium term via a configuration option/featute macro, as @b-scholz suggested.

woodruffw avatar Jul 12 '21 18:07 woodruffw

Is regexes ever precompiled or are they always compiled at runtime in Souffle? I guess this should also play a role in which regex engine would be suitable. If I recall correctly re2 does not do any NFA -> DFA conversions. I would assume that std::regex does if the string is available at compile time. I'm not sure though.

my team's particular use case would benefit heavily from RE2's architecture

@woodruffw Just out of curiosity, would you be able to explain your use case a bit more?

lyxell avatar Jul 12 '21 21:07 lyxell

I would assume that std::regex does if the string is available at compile time. I'm not sure though.

If only we were so lucky :slightly_smiling_face: -- Hana Dusíková has a proposal for compile-time evaluation for std::regex, but it's not slated until C++23 at the earliest. The language of the proposal suggests that every current implementation of std::regex is 100% runtime (if only by convenience, not necessarily by spec).

Just out of curiosity, would you be able to explain your use case a bit more?

Some members of my team are working on whole-program pointer analysis, including on programs with pathological search spaces (think hundreds of globally reachable large, multi-dimensional arrays).

Some of our Souffle rules include very basic regular expressions, where (sub)string matching doesn't quite cut it. Some of these pathological inputs actually cause std::regex to segfault due to a stack overflow, since std::regex's implementation is naively recursive. RE2 doesn't have this problem thanks to its FSM.

woodruffw avatar Jul 12 '21 22:07 woodruffw

If only we were so lucky -- Hana Dusíková has a proposal for compile-time evaluation for std::regex, but it's not slated until C++23 at the earliest. The language of the proposal suggests that every current implementation of std::regex is 100% runtime (if only by convenience, not necessarily by spec).

Yikes. Seems like I really overestimated the state of std::regex.

Some of our Souffle rules include very basic regular expressions, where (sub)string matching doesn't quite cut it. Some of these pathological inputs actually cause std::regex to segfault due to a stack overflow, since std::regex's implementation is naively recursive. RE2 doesn't have this problem thanks to its FSM.

Interesting. Thanks!

lyxell avatar Jul 13 '21 13:07 lyxell

Is regexes ever precompiled or are they always compiled at runtime in Souffle?

Ideally we'd have precompiled for literals + an LRU cache for dynamic expressions. Our fork has a hacky wrapper around SymTbl that memoises PCRE2 compiled expressions. Works well enough for our uses (no dynamic regexs) but it'd be nice to be able to evict unused FSMs/JITs.

SanaaHamel avatar Jul 14 '21 01:07 SanaaHamel

an LRU cache for dynamic expressions

Yeah, this is another source of performance woes with the current implementation -- std::regex objects are initialized on demand, which might mean (tens of) millions of identical allocations for a rule that's hit frequently.

woodruffw avatar Jul 14 '21 01:07 woodruffw

Another alternative is to implement a functor library with external functors for regex.

b-scholz avatar Jul 14 '21 02:07 b-scholz

The current functor interface won't really do. We've extended the regex intrinsics full atoms and support iteration over capture groups (e.g. a(x, i) :- regex_capture(" a bc d ", "\\W(\\w+)\\W", i, offset). -> ("a", 1), ("bc", 2), ("d", 3)). Are there others who'd see serious use for a full FFI-for-atoms-with-optional-indexing?

SanaaHamel avatar Jul 15 '21 23:07 SanaaHamel