regex aho-corasick should be applied for cases like `\b(literal1|literal2|...|literalN)\b`

trafficstars

Discussed in https://github.com/rust-lang/regex/discussions/890

^{Originally posted by Guillermogsjc July 19, 2022} Hi, to match efficiently large amounts of alternations, I guess it is interesting to trigger aho_corasick variant here https://github.com/rust-lang/regex/blob/9ca3099037dcb2faf1b49e6493f4c758532f2da1/src/exec.rs#L91 regarding doc:

/// This is only set when the entire regex is a simple unanchored /// alternation of literals. We could probably use it more circumstances, /// but this is already hacky enough in this architecture.

The question is: is there any way to use word boundaries in such a way this expression is highly optimized for a thing like this?

r"\b(a|... #massive ammount of literal alternations here# ...|z)\b"

or with (?-u:\b) instead of \b .

And... regarding PERFORMANCE documentation here

there is no problem with using non-greedy matching or having lots of alternations in your regex

this previously stated regex would be in the set of "no problem" ?

Thanks

Jul 19 '22 12:07 BurntSushi

I think the code comment is somewhat clear that the Aho-Corasick optimization only applies in some very limited cases today. Ideally, yes, we should absolutely enable it for the case of \b(literal1|literal2|...|literalN)\b. However, I do not want to add any more subtle optimizations (and this would be one example) to the regex crate until #656 is done. The current infrastructure is just too brittle.

To work around this today, I think you have a couple of options available to you:

Use the aho-corasick crate directly. When you find a match, then that it's surrounded by word boundaries manually.
Use regex-automata. You'll have to use (?-u:\b) and the regex will take longer to build than what aho-corasick would do, but it should be about as fast.
Don't worry about the Aho-Corasick optimization at all. Use (?-u:\b). In that case, this crate's lazy DFA will be used, which is likely to be as fast as Aho-Corasick in practice. However, this really depends on the number of literals you have here.

Jul 19 '22 12:07 BurntSushi

This example demonstrates why this optimization is subtle, and more generally, the difficulty of composition with respect to regex internals:

$ regex-cli count matches nfa thompson pikevm --matches 'samwise' '(?:sam|samwise)'
build pike vm time:  29.693µs
 create cache time:  658ns
        cache size:  800
       search time:  4.221µs
            counts:  [1]

PatternID(0): [0, 3)

$ regex-cli count matches nfa thompson pikevm --matches 'samwise' '\b(?:sam|samwise)\b'
build pike vm time:  26.135µs
 create cache time:  585ns
        cache size:  896
       search time:  11.146884ms
            counts:  [1]

PatternID(0): [0, 7)

The key thing here is that you can't just try to match the inner (?:sam|samwise) without regard for the surrounding word boundaries, because the surrounding \b's might change which thing gets reported.

Here are some ideas (mostly written with a future me as a target audience, so apologies if they appear cryptic):

Normally, Aho-Corasick is used with leftmost-first or "preference order" match semantics. This way, it is purely consistent with the match semantics of the regex crate. But in this case, maybe we need to use standard match semantics and execute an overlapping search. That way, we can be sure that we're guaranteed to see every possible match. So for example, this code:

use aho_corasick::{MatchKind, AhoCorasickBuilder};

fn main() {
    let ac = AhoCorasickBuilder::new()
        .match_kind(MatchKind::Standard)
        .build(&["sam", "samwise"]);
    for m in ac.find_overlapping_iter("samwise") {
        println!("{:?}", (m.pattern(), m.start()..m.end()));
    }
}

outputs:

(0, 0..3)
(1, 0..7)

So if we iterated over those matches and then tried to test whether the surrounding word boundary assertions held and reported the first one that did, we'd get the right answer here.

Unfortunately, I don't think this works in all cases. Consider this one:

use aho_corasick::{MatchKind, AhoCorasickBuilder};

fn main() {
    let ac = AhoCorasickBuilder::new()
        .match_kind(MatchKind::Standard)
        .build(&["sam the wise", "the"]);
    for m in ac.find_overlapping_iter("sam the wise") {
        println!("{:?}", (m.pattern(), m.start()..m.end()));
    }
}

outputs:

(1, 4..7)
(0, 0..12)

So if we applied our idea from above, we'd report the as the match since it's the first one reported by Aho-Corasick where the word boundary assertions are satisfied. But this would be the wrong answer:

$ regex-cli count matches nfa thompson pikevm --matches 'sam the wise' '\b(?:sam the wise|the)\b'
build pike vm time:  32.37µs
 create cache time:  601ns
        cache size:  1136
       search time:  11.419901ms
            counts:  [1]

PatternID(0): [0, 12)

The right answer is sam the wise because of leftmost-first or "preference order" matching.

I think the reason why this case fails is because \b matches at a location other then the start or end of sam the wise. Thus, there is an opportunity for a substring in sam the wise to match even if preference should be given to sam the wise.

More generally, my intuition is that for any regex of the form \b(?:re)\b, we can match re independently and then test the word boundary assertions as long as the following holds:

We report every possible match of re, including overlapping matches.
For all strings s in the language defined by re, it must be true that \b matches at s[0], s[s.len()-1], both or neither, but nowhere else.

Probably this can be generalized even further to re-left(?:re-inner)re-right.

Jul 19 '22 13:07 BurntSushi

regex regex copied to clipboard

aho-corasick should be applied for cases like `\b(literal1|literal2|...|literalN)\b`

Discussed in https://github.com/rust-lang/regex/discussions/890

regex
regex copied to clipboard