regex
regex copied to clipboard
plan: moving regex engines to regex-automata
In discussions in #524, it became clearer to me that it might be good to say more about what my short/long term plans are for the regex
crate.
TL;DR - Push all regex engines down into the regex-automata
crate, and bring that in as an internal-but-published dependency of regex
as an official part of this repository. It will become a second required dependency (with the first being regex-syntax
).
The main motivation for doing this is to clean up the regex internals, specifically with the intent to make it easier to add more optimizations. Currently, the infrastructure is a bit of a hodge podge and it can be quite difficult to know where and when a particular regex engine is being used. On top of that, since every regex engine is strictly internal, testing each engine independently is quite annoying, as it requires exporting undocumented APIs. As a result, bugs have cropped up where certain engines are inconsistent with other engines. Similarly, benchmarking each engine independent of the other is made difficult, which in turn makes it difficult to focus on optimizing certain scenarios that come up in the course of regex execution.
On top of all of this, putting the regex engines into their own crate with a proper API means that folks can use those engines directly. It also provides an outlet of sorts to provide lower level APIs that would I otherwise wouldn't want cluttering up the main regex
crate too much. The main thing on my mind here are streaming APIs (see #425), but there may be other things of use.
I've actually been working towards this goal for a long time now. Arguably, i started around 3 years ago when I began rewriting the regex-syntax
crate. I then moved to rewriting memchr
and aho-corasick
as well. In the process, I pushed complexity down as much as I could. For example, the HIR in regex-syntax
no longer exposes any regex flags (such as case insensitivity), which makes regex compilers simpler. As another example, aho-corasick
now supports leftmost-first matching, which matches the regex crate's match semantics precisely, and now makes it easier to defer to aho-corasick
for an alternation of literals. Yet another example is the Teddy algorithm that uses SIMD for finding matches of multiple literals very quickly. That has been pushed down into aho-corasick
as well, which means more people get to benefit from it and it no longer needs to be maintained inside of regex
proper.
A little over a year ago, I started work on regex-automata
. The main use case for regex-automata
was to generate fully compiled DFAs that could be cheaply embedded into a program and deserialized cheaply for easy matching. (For example, they power the grapheme, word and sentence iterators in the bstr
crate.)
In the course of building regex-automata
, I had to write another HIR -> NFA compiler, with the intent that this would one day become the compiler inside the regex
crate. It has many improvements over the existing one. Firstly, it is much simpler by doing away with the "instruction hole" mechanism used in the current compiler. Secondly, it is also simpler in that it does away with dichotomy between a Unicode NFA and a byte-based NFA, which has been the cause of many bugs and increases the complexity of the matching engines. Thirdly, by getting rid of the Unicode/bytes dichotomy in favor of just using bytes, we will be able to compile one fewer NFAs than what the regex
crate currently does (one Unicode NFA for the Pike VM and the backtracker, 1 forwards byte NFA and 1 backwards byte NFA for the lazy DFA). Fourthly, it is more aggressive about epsilon removal and does away with the binary "alt" instructions in favor of flattening the byte code, which should decrease overhead substantially when matching. Finally, it use's Daciuk's algorithm (the same one used in the fst
crate) to compiler nearly minimal UTF-8 automata in linear time, which substantially reduces the size of the corresponding NFAs and DFAs, while also making them faster to execute by virtue of having fewer epsilon transitions.
regex-automata
also contains new test infrastructure, and most of the tests in the regex
crate have already been ported to it. This test infrastructure should make it much easier to add new tests and also much faster to run while testing more engines. Today's infrastructure is a terrible macro soup, and just building it takes forever because every test is compiled for every regex engine that we test. The new test infrastructure instead loads tests from TOML files and runs them using its own harness. The downside is that this pushes more things outside of Rust's standard unit test harness, but it's well worth it.
As of a couple months ago, I've begun preparing regex-automata
to absorb the matching engines from regex
proper. This will include the lazy DFA, the Pike VM and the backtracker. regex-automata
also provides its own ahead-of-time-compiled DFAs, which I hope to use as yet another engine for cases where the regex is small and full DFA compilation is cheap. This organization should drastically decrease the complexity of adding new regex engines. I hope this will allow me to finally bring in things like one-pass DFA (see #467), as well as some other ideas I've had for years, such as a bit-parallel NFA and even a JIT powered by Cranelift, basically with the goal of fixing one of the weakest parts of the regex
crate right now: the performance of matching with capturing groups. (N.B. If a JIT were added, it would be an optional opt-in feature.)
Once regex-automata
has absorbed the regex engines inside of regex
, my plan at that point is to move regex-automata
into this repository and start integrating it into regex
proper. At that point, the main responsibility of regex
will be to glue everything together, plan the appropriate optimizations, including literal oriented optimizations and provide the higher level API that it does today. At this point in time, I do not plan on making any breaking changes to regex
during all of this.
As a stretch goal, one of the things I'd like to do is to bring the cheap deserialization of regex-automata's DFAs to the regex
crate proper. This would permit users to generate pre-compiled Regex
objects that can be embedded into any program. This would allow one to completely avoid paying for compilation time. This does require a lot of work though, so I don't know if this will happen initially. (Doing this basically implies that every single representation of a finite state machine, including all of the literal prefilters, need to be able to work directly from a more primitive representation such as a &[u8]
.)
Once this Great Regex Engine Refactor is complete, the result will have been a complete rewrite of the regex
crate.
You probably already plan to do something like this, but just in case you have not already planned on it, I think it would be useful for the regex-automata
crate to have a set of features that people can use to pull in just a single engine. I'm thinking in particular about situations where people might care more about code size than performance and are happy with just pulling in the PikeVM. It might even be useful to have a separate regex-engine-lite
crate that just has the PikeVM so that people who don't know the difference between the different matchers can pull it in.
@ethanpailes Yeah I may consider that. Each engine on its own doesn't usually take up too much code and usually doesn't require any extra dependencies. But sure, chopping it up with features like I did for regex
is definitely on the table. I'll probably avoid adding new *-lite
crates though.
Still amazed to witness someone as clever and dedicated as you. The programming community would be a sadder place without you. Thank you for you wonderful hard work.
TL;DR - Push all regex engines down into the regex-automata crate, and bring that in as an internal-but-published dependency of regex as an official part of this repository. It will become a second required dependency (with the first being regex-syntax). [...] Once this Great Regex Engine Refactor is complete, the result will have been a complete rewrite of the regex crate.
Given that you are pushing regex engines out of the regex
crate, shouldn't it be called the Great Regex Engine Pushout Refactor, or GREP-R for short? :wink:
Btw, love your work and the dedication you put into maintaining one of the most important crates in the rust ecosystem! :muscle:
Once regex-automata has absorbed the regex engines inside of regex, my plan at that point is to move regex-automata into this repository
At this point in time, I do not plan on making any breaking changes to regex during all of this.
While this sounds like a great effort for regex
and the unification of the two different crates, I'd be curious what this means for users of regex-automata
. I'm not particularly concerned about breaking changes itself, but is there any plan to remove functionality in favor of unification?
It seems like the basics of regex-automata
are fundamentally required for both regex
and regex-automata
, with a DFA and transition between states, but I'd like to make sure that usecases like streaming regex parsers are still captured in that new ecosystem. Especially since even regex-automata
has that as a secondary usecase since it's rarely desired of course.
Fundamentally it sounds to me like regex-automata
users will just benefit from more regex engine choices without any drawbacks, which would obviously be great.
Fundamentally it sounds to me like regex-automata users will just benefit from more regex engine choices without any drawbacks, which would obviously be great.
That's the intent yes. I have no plans to remove functionality from regex-automata
. :-)
I have a PR up for phase 1 of 2 of this transition: https://github.com/rust-lang/regex/pull/977
Basically, this PR is preparing for the regex
crate to cut over to regex-automata
for its internals. That will include bringing regex-automata
into this repository. I don't have a specific timeline for when that will happen, but I do want to let this first phase bake for a little bit. Hopefully within a month. regex-automata
itself is ready.
For anyone following this thread, regex 1.9
(the next release) is going to close this issue. Finally. I plan to release it after I've finished documentation and polishing work. (There is a fair bit to do here.)
I am also currently planning to release a new crate, regex-lite
, simultaneously with regex 1.9
. You can see more about that and why specifically I chose to do it with regex 1.9
here: https://github.com/rust-lang/regex/issues/961#issuecomment-1527512468