regex plan: moving regex engines to regex-automata

In discussions in #524, it became clearer to me that it might be good to say more about what my short/long term plans are for the regex crate.

TL;DR - Push all regex engines down into the regex-automata crate, and bring that in as an internal-but-published dependency of regex as an official part of this repository. It will become a second required dependency (with the first being regex-syntax).

The main motivation for doing this is to clean up the regex internals, specifically with the intent to make it easier to add more optimizations. Currently, the infrastructure is a bit of a hodge podge and it can be quite difficult to know where and when a particular regex engine is being used. On top of that, since every regex engine is strictly internal, testing each engine independently is quite annoying, as it requires exporting undocumented APIs. As a result, bugs have cropped up where certain engines are inconsistent with other engines. Similarly, benchmarking each engine independent of the other is made difficult, which in turn makes it difficult to focus on optimizing certain scenarios that come up in the course of regex execution.

On top of all of this, putting the regex engines into their own crate with a proper API means that folks can use those engines directly. It also provides an outlet of sorts to provide lower level APIs that would I otherwise wouldn't want cluttering up the main regex crate too much. The main thing on my mind here are streaming APIs (see #425), but there may be other things of use.

I've actually been working towards this goal for a long time now. Arguably, i started around 3 years ago when I began rewriting the regex-syntax crate. I then moved to rewriting memchr and aho-corasick as well. In the process, I pushed complexity down as much as I could. For example, the HIR in regex-syntax no longer exposes any regex flags (such as case insensitivity), which makes regex compilers simpler. As another example, aho-corasick now supports leftmost-first matching, which matches the regex crate's match semantics precisely, and now makes it easier to defer to aho-corasick for an alternation of literals. Yet another example is the Teddy algorithm that uses SIMD for finding matches of multiple literals very quickly. That has been pushed down into aho-corasick as well, which means more people get to benefit from it and it no longer needs to be maintained inside of regex proper.

A little over a year ago, I started work on regex-automata. The main use case for regex-automata was to generate fully compiled DFAs that could be cheaply embedded into a program and deserialized cheaply for easy matching. (For example, they power the grapheme, word and sentence iterators in the bstr crate.)

In the course of building regex-automata, I had to write another HIR -> NFA compiler, with the intent that this would one day become the compiler inside the regex crate. It has many improvements over the existing one. Firstly, it is much simpler by doing away with the "instruction hole" mechanism used in the current compiler. Secondly, it is also simpler in that it does away with dichotomy between a Unicode NFA and a byte-based NFA, which has been the cause of many bugs and increases the complexity of the matching engines. Thirdly, by getting rid of the Unicode/bytes dichotomy in favor of just using bytes, we will be able to compile one fewer NFAs than what the regex crate currently does (one Unicode NFA for the Pike VM and the backtracker, 1 forwards byte NFA and 1 backwards byte NFA for the lazy DFA). Fourthly, it is more aggressive about epsilon removal and does away with the binary "alt" instructions in favor of flattening the byte code, which should decrease overhead substantially when matching. Finally, it use's Daciuk's algorithm (the same one used in the fst crate) to compiler nearly minimal UTF-8 automata in linear time, which substantially reduces the size of the corresponding NFAs and DFAs, while also making them faster to execute by virtue of having fewer epsilon transitions.

regex-automata also contains new test infrastructure, and most of the tests in the regex crate have already been ported to it. This test infrastructure should make it much easier to add new tests and also much faster to run while testing more engines. Today's infrastructure is a terrible macro soup, and just building it takes forever because every test is compiled for every regex engine that we test. The new test infrastructure instead loads tests from TOML files and runs them using its own harness. The downside is that this pushes more things outside of Rust's standard unit test harness, but it's well worth it.

As of a couple months ago, I've begun preparing regex-automata to absorb the matching engines from regex proper. This will include the lazy DFA, the Pike VM and the backtracker. regex-automata also provides its own ahead-of-time-compiled DFAs, which I hope to use as yet another engine for cases where the regex is small and full DFA compilation is cheap. This organization should drastically decrease the complexity of adding new regex engines. I hope this will allow me to finally bring in things like one-pass DFA (see #467), as well as some other ideas I've had for years, such as a bit-parallel NFA and even a JIT powered by Cranelift, basically with the goal of fixing one of the weakest parts of the regex crate right now: the performance of matching with capturing groups. (N.B. If a JIT were added, it would be an optional opt-in feature.)

Once regex-automata has absorbed the regex engines inside of regex, my plan at that point is to move regex-automata into this repository and start integrating it into regex proper. At that point, the main responsibility of regex will be to glue everything together, plan the appropriate optimizations, including literal oriented optimizations and provide the higher level API that it does today. At this point in time, I do not plan on making any breaking changes to regex during all of this.

As a stretch goal, one of the things I'd like to do is to bring the cheap deserialization of regex-automata's DFAs to the regex crate proper. This would permit users to generate pre-compiled Regex objects that can be embedded into any program. This would allow one to completely avoid paying for compilation time. This does require a lot of work though, so I don't know if this will happen initially. (Doing this basically implies that every single representation of a finite state machine, including all of the literal prefilters, need to be able to work directly from a more primitive representation such as a &[u8].)

Once this Great Regex Engine Refactor is complete, the result will have been a complete rewrite of the regex crate.

Mar 24 '20 14:03 BurntSushi

You probably already plan to do something like this, but just in case you have not already planned on it, I think it would be useful for the regex-automata crate to have a set of features that people can use to pull in just a single engine. I'm thinking in particular about situations where people might care more about code size than performance and are happy with just pulling in the PikeVM. It might even be useful to have a separate regex-engine-lite crate that just has the PikeVM so that people who don't know the difference between the different matchers can pull it in.

Mar 24 '20 14:03 ethanpailes

@ethanpailes Yeah I may consider that. Each engine on its own doesn't usually take up too much code and usually doesn't require any extra dependencies. But sure, chopping it up with features like I did for regex is definitely on the table. I'll probably avoid adding new *-lite crates though.

Mar 24 '20 14:03 BurntSushi

Still amazed to witness someone as clever and dedicated as you. The programming community would be a sadder place without you. Thank you for you wonderful hard work.

Mar 24 '20 18:03 Voultapher

TL;DR - Push all regex engines down into the regex-automata crate, and bring that in as an internal-but-published dependency of regex as an official part of this repository. It will become a second required dependency (with the first being regex-syntax). [...] Once this Great Regex Engine Refactor is complete, the result will have been a complete rewrite of the regex crate.

Given that you are pushing regex engines out of the regex crate, shouldn't it be called the Great Regex Engine Pushout Refactor, or GREP-R for short? :wink:

Btw, love your work and the dedication you put into maintaining one of the most important crates in the rust ecosystem! :muscle:

Mar 24 '20 19:03 giovanniberti

Once regex-automata has absorbed the regex engines inside of regex, my plan at that point is to move regex-automata into this repository

At this point in time, I do not plan on making any breaking changes to regex during all of this.

While this sounds like a great effort for regex and the unification of the two different crates, I'd be curious what this means for users of regex-automata. I'm not particularly concerned about breaking changes itself, but is there any plan to remove functionality in favor of unification?

It seems like the basics of regex-automata are fundamentally required for both regex and regex-automata, with a DFA and transition between states, but I'd like to make sure that usecases like streaming regex parsers are still captured in that new ecosystem. Especially since even regex-automata has that as a secondary usecase since it's rarely desired of course.

Fundamentally it sounds to me like regex-automata users will just benefit from more regex engine choices without any drawbacks, which would obviously be great.

Mar 24 '20 23:03 chrisduerr

Fundamentally it sounds to me like regex-automata users will just benefit from more regex engine choices without any drawbacks, which would obviously be great.

That's the intent yes. I have no plans to remove functionality from regex-automata. :-)

Mar 24 '20 23:03 BurntSushi

I have a PR up for phase 1 of 2 of this transition: https://github.com/rust-lang/regex/pull/977

Basically, this PR is preparing for the regex crate to cut over to regex-automata for its internals. That will include bringing regex-automata into this repository. I don't have a specific timeline for when that will happen, but I do want to let this first phase bake for a little bit. Hopefully within a month. regex-automata itself is ready.

Apr 17 '23 18:04 BurntSushi

For anyone following this thread, regex 1.9 (the next release) is going to close this issue. Finally. I plan to release it after I've finished documentation and polishing work. (There is a fair bit to do here.)

I am also currently planning to release a new crate, regex-lite, simultaneously with regex 1.9. You can see more about that and why specifically I chose to do it with regex 1.9 here: https://github.com/rust-lang/regex/issues/961#issuecomment-1527512468

Apr 28 '23 12:04 BurntSushi

regex regex copied to clipboard

plan: moving regex engines to regex-automata

regex
regex copied to clipboard