Serialization of compiled regular expressions
Since compiling a regular expression is slow, it would be nice (especially for applications that use lots of them) to be able to serialize a compiled regex to a byte string, which can then be embedded in the application or stored in a resource. Upon startup, the compiled regex can be be loaded. The regex crate would validate the regex prior to use.
I agree that'd be nice, but it remains to be seen whether this would have a significant performance improvement or not. For example, deserializing whatever format the regex is persisted in could in theory be just as slow. What would be nice is if the regex could be used directly from that binary format, but that seems non-trivial.
This is definitely worth exploring.
If someone wanted to work on this, I'd be willing to mentor it. It will take a bit of work. Essentially, we'll need to serialize the entire ExecReadOnly structure, which contains 3 compiled regex programs, the original regex strings, the match type and a literal searcher. The regex programs themselves are a bit beefy but mostly just plain old data. The literal searcher is a bit more complex, since it can contain things like an Aho-Corasick automaton or a Teddy matcher. That means you'd either have to A) implement a serialization technique for both of those matches or B) just re-build them when deserializing the regex. (B) sounds like a fine place to start. In particular, building the literal matchers is probably very fast, so whether (A) is a win or not actually isn't clear.
The first thing to do is to figure out the serialization format. I would be somewhat opposed to bringing in another crate for this, which makes this task artificially harder. One possible alternative would be to put this behind a feature and make it an optional dependency.
That's a very newbie question, but with the online DFA setup does the compilation stage really take significant time compared to using the compiled regex the first few times?
@glapul Nope. :-) When I speak of "compilation" I'm specifically referring to the point at which the regex's abstract syntax tree is translated to a sequence of instructions. This sequence of instructions represents an NFA, but its representation makes it possible to execute in a VM or even via a backtracking search. The representation is also what's used to build the DFA. In other words, the lazy DFA isn't a transformation from AST to DFA, but rather, a transformation from NFA to DFA.
Finally, even if we could somehow go from AST to DFA, we still need the NFA because the DFA can't answer every question supported by the regex API, such as returning sub-capture locations.
FWIW, compile times are most directly affected by the size of the regex. For example, a regex like [0-9]+ is tiny but a regex like \d+ is quite a bit larger. Why? Unicode.
@BurntSushi While it's understandable that saving/restoring compiled regular expressions is a hard issue and requires thorough design, I wonder if it would be possible to add at least string-based serde::{Serialize, Deserialize} implementations to Regex structure? I store regex instances as part of bigger structures and it could be quite helpful to have these implemented on Regex itself so that consumer could just derive(Serialize, Deserialize) any outer structs without hacks or worrying about how exactly such (de)serialization is implemented internally (as it still leaves opportunity to swap implementations to compiled regexes in the future).
@RReverser It's an interesting point. I've done it myself on occasion:
#[derive(Clone, Debug)]
pub struct Regex(regex::Regex);
impl ops::Deref for Regex {
type Target = regex::Regex;
fn deref(&self) -> ®ex::Regex { &self.0 }
}
impl<'de> serde::Deserialize<'de> for Regex {
fn deserialize<D>(de: D) -> Result<Regex, D::Error>
where D: serde::Deserializer<'de>
{
use serde::de::{Error, Visitor};
struct RegexVisitor;
impl<'de> Visitor<'de> for RegexVisitor {
type Value = Regex;
fn expecting(&self, f: &mut fmt::Formatter) -> fmt::Result {
f.write_str("a regular expression pattern")
}
fn visit_str<E: Error>(self, v: &str) -> Result<Regex, E> {
regex::Regex::new(v).map(Regex).map_err(|err| {
E::custom(err.to_string())
})
}
}
de.deserialize_str(RegexVisitor)
}
}
I'm not sure if I want to commit to this in the regex API though. Perhaps it could start in a separate crate?
@BurntSushi Yeah I've done this myself, but newtypes, while widely used as a workaround for such situations in Rust, are still breaking API changes each time you need to introduce them, so if possible to gain support for (de)serializations in base crate, it would be much nicer.
@RReverser I mean, sure, as a general principle, but I don't see how that applies here. Firstly, one generally doesn't expose a Regex in a public API. Second, having a separate stable crate would solve that issue anyway.
Perhaps you elaborate more specifically on your concrete use case?
My use case is parsing/interpreting ASTs where regex is just one of possible values, and, while Regex instance is not exposed directly to the user, it's stored as part of the AST for faster evaluation, and AST itself might be serialized/deserialized to the disk instead of original source.
Sure, newtype is sufficient for this usecase, I'm just saying I usually see it as a last resort since, as soon as you need to implement several traits, nesting newtypes from own and outer crates gets quite dirty.
Although if you add Serialize to it and publish as a separate crate, that would be better than nothing :) Just thought I'd suggest adding such support natively.
Another interesting option could be adding these traits to https://doc.rust-lang.org/regex/regex_syntax/enum.Expr.html and allowing to construct regex out of such Expr, but, as far as I understand, you don't want to make it a public API yet?
@RReverser The regex crate will never ever have a public dependency on regex-syntax. :-) regex-syntax is basically an internal implementation detail of regexes, and it must be allowed to evolve independently of regex.
Stated differently, the interface boundary is the concrete syntax of a regular expression.
Hmm, okay, I just thought it's high-level enough to be exposed. Anyway, given that regex might be serialized in various representations, I tend to agree that separate crate should work in this case and would be enough to prevent most issues.
If someone wanted to work on this, I'd be willing to mentor it.
@BurntSushi Do you think this is still feasible in face of regex internal refactorings you've been planning?
Maybe after that's done. But definitely not right now.
For anyone watching this issue, I just released regex-automata, which is a lower level regex library. One of its features is the serialization and deserialization of regexes, and convenient support for this has been added to ucd-generate via its dfa and regex sub-commands.
Before getting too excited, you'll want to check out the list of differences between regex and regex-automata to see whether it's possible to use it.
I don't know the internals of this crate, but would it be possible (in order to pre-compile it in the executable) to dump the memory area used by the compiled regex in an array of bytes at compile time (I mean, with a proc macro)?
Those bytes would be stored in a constant, and could be unsafely casted to a Regex instance. This would require the Regex struct not to contain pointers.
Is it possible? Yes, pretty much anything is possible.
Is it plausible? No. My estimation is that it would take several months of work for me to do it. It would also very likely increase the amount of unsafe in this crate by a non-trivial number. And it would introduce new compatibility hazards.
regex-automata provides this capacity for fully compiled DFAs today, as mentioned above. That's your best bet right now.
This would require the
Regexstruct not to contain pointers.
Right. There are pointers everywhere. You are very likely drastically underestimating the complexity of this crate's internals. :)
I'm going to close this out because I don't see this happening any time soon. Moreover, regex-automata 0.2 (and especially 0.3 once it's released) will support serialization of regex DFAs. DFAs won't work in every use case, but if you absolutely need serialized regexes, then it might be good enough.
The full story here really is not only a monumental amount of work, but it's not a one-time investment. By exposing a stable binary representation, it also makes internal evolution so much harder because any changes need to be reconciled with the binary format. I really just do not know if it will ever happen.
So I'm going to mark this as "not planned" for now.