regex
regex copied to clipboard
aho-corasick should be applied for cases like `\b(literal1|literal2|...|literalN)\b`
Discussed in https://github.com/rust-lang/regex/discussions/890
Originally posted by Guillermogsjc July 19, 2022
Hi, to match efficiently large amounts of alternations, I guess it is interesting to trigger aho_corasick variant here https://github.com/rust-lang/regex/blob/9ca3099037dcb2faf1b49e6493f4c758532f2da1/src/exec.rs#L91 regarding doc:
/// This is only set when the entire regex is a simple unanchored /// alternation of literals. We could probably use it more circumstances, /// but this is already hacky enough in this architecture.
The question is: is there any way to use word boundaries in such a way this expression is highly optimized for a thing like this?
r"\b(a|... #massive ammount of literal alternations here# ...|z)\b"
or with (?-u:\b) instead of \b .
And... regarding PERFORMANCE documentation here
there is no problem with using non-greedy matching or having lots of alternations in your regex
this previously stated regex would be in the set of "no problem" ?
Thanks
I think the code comment is somewhat clear that the Aho-Corasick optimization only applies in some very limited cases today. Ideally, yes, we should absolutely enable it for the case of \b(literal1|literal2|...|literalN)\b. However, I do not want to add any more subtle optimizations (and this would be one example) to the regex crate until #656 is done. The current infrastructure is just too brittle.
To work around this today, I think you have a couple of options available to you:
- Use the
aho-corasickcrate directly. When you find a match, then that it's surrounded by word boundaries manually. - Use
regex-automata. You'll have to use(?-u:\b)and the regex will take longer to build than whataho-corasickwould do, but it should be about as fast. - Don't worry about the Aho-Corasick optimization at all. Use
(?-u:\b). In that case, this crate's lazy DFA will be used, which is likely to be as fast as Aho-Corasick in practice. However, this really depends on the number of literals you have here.
This example demonstrates why this optimization is subtle, and more generally, the difficulty of composition with respect to regex internals:
$ regex-cli count matches nfa thompson pikevm --matches 'samwise' '(?:sam|samwise)'
build pike vm time: 29.693µs
create cache time: 658ns
cache size: 800
search time: 4.221µs
counts: [1]
PatternID(0): [0, 3)
$ regex-cli count matches nfa thompson pikevm --matches 'samwise' '\b(?:sam|samwise)\b'
build pike vm time: 26.135µs
create cache time: 585ns
cache size: 896
search time: 11.146884ms
counts: [1]
PatternID(0): [0, 7)
The key thing here is that you can't just try to match the inner (?:sam|samwise) without regard for the surrounding word boundaries, because the surrounding \b's might change which thing gets reported.
Here are some ideas (mostly written with a future me as a target audience, so apologies if they appear cryptic):
Normally, Aho-Corasick is used with leftmost-first or "preference order" match semantics. This way, it is purely consistent with the match semantics of the regex crate. But in this case, maybe we need to use standard match semantics and execute an overlapping search. That way, we can be sure that we're guaranteed to see every possible match. So for example, this code:
use aho_corasick::{MatchKind, AhoCorasickBuilder};
fn main() {
let ac = AhoCorasickBuilder::new()
.match_kind(MatchKind::Standard)
.build(&["sam", "samwise"]);
for m in ac.find_overlapping_iter("samwise") {
println!("{:?}", (m.pattern(), m.start()..m.end()));
}
}
outputs:
(0, 0..3)
(1, 0..7)
So if we iterated over those matches and then tried to test whether the surrounding word boundary assertions held and reported the first one that did, we'd get the right answer here.
Unfortunately, I don't think this works in all cases. Consider this one:
use aho_corasick::{MatchKind, AhoCorasickBuilder};
fn main() {
let ac = AhoCorasickBuilder::new()
.match_kind(MatchKind::Standard)
.build(&["sam the wise", "the"]);
for m in ac.find_overlapping_iter("sam the wise") {
println!("{:?}", (m.pattern(), m.start()..m.end()));
}
}
outputs:
(1, 4..7)
(0, 0..12)
So if we applied our idea from above, we'd report the as the match since it's the first one reported by Aho-Corasick where the word boundary assertions are satisfied. But this would be the wrong answer:
$ regex-cli count matches nfa thompson pikevm --matches 'sam the wise' '\b(?:sam the wise|the)\b'
build pike vm time: 32.37µs
create cache time: 601ns
cache size: 1136
search time: 11.419901ms
counts: [1]
PatternID(0): [0, 12)
The right answer is sam the wise because of leftmost-first or "preference order" matching.
I think the reason why this case fails is because \b matches at a location other then the start or end of sam the wise. Thus, there is an opportunity for a substring in sam the wise to match even if preference should be given to sam the wise.
More generally, my intuition is that for any regex of the form \b(?:re)\b, we can match re independently and then test the word boundary assertions as long as the following holds:
- We report every possible match of
re, including overlapping matches. - For all strings
sin the language defined byre, it must be true that\bmatches ats[0],s[s.len()-1], both or neither, but nowhere else.
Probably this can be generalized even further to re-left(?:re-inner)re-right.