regex Missed optimization for alterated word-lists — patterns like `\b(foo|bar|baz|quux|...)\b`

Describe your feature request

I'm unsure if I should report this as a feature request for a currently-missing "optimization", or a performance bug due to a missed optimization. The feature request template is easier to fill out, so... Here we are.

TLDR: Matching against a long list of strings is slower than it should be. Manually (or automatically) optimizing it in ways that a computer should be able to easily perform (see the giant gross regexes in code blocks) seems to have up to a ~2x performance benefit in some cases, such as the provided benchmark.

I'm working on a crate that performs syntax highlighting of source code (something like a rust equivalent to pygments).

As you may or may not imagine, it performs a lot of regex matching. One of the most situations in this kind of thing is "match any of the words in this medium-sized list" (usually on the order of 10s of items — probably very rarely more than 100, if ever).

That is, this is trying to match a regex which conceptually looks something like \b(foo|bar|baz|quux|...)\b, and it's typically used for all kinds of things: keywords, builtin functions/types, well-known stuff from a language's stdlib, special constants, ...

Apparently from what I gather, historically this sort of regex has been a big pain point in terms of performance for other regex-based highlighting engines. Most of them seem to have some comments or guidance around it (often "avoid these"). Interestingly, pygments has a function that compiles an optimized regex from a list of matching words. The code for this is here: https://github.com/pygments/pygments/blob/faf69c028f0efa7798a938163e8432d838acfda2/pygments/regexopt.py

I had hoped that this would be unnecessary for the regex crate. Even saying something in a fit of hubris like "It'll be fine..." and reassuring myself that "some sort of aho-corasick-ey thing" will save the day, and that "the python code seems like it's an ad-hoc finite automata simplification", which I don't have to worry about because the regex crate is so good at that stuff. (A combination of: making things up to sound smart and stating things as facts because I hope that they are true).

Anyway, I wasn't going to completely handwave it, so I did a smoke test with a benchmark, and (you can probably guess the results already, since I'm here and I wrote them in the TLDR above)... It seems like it's in fact not entirely fine.

The concrete numbers on my machine were (... you can tell from the benchmark names that this didn't quite turn out how I had expected):

test naive                   ... bench:   2,177,060 ns/iter (+/- 248,300)
test quote_optimized_unquote ... bench:   1,148,483 ns/iter (+/- 350,485)

Specifically, I get a 2x speedup from using this (generated using the python pygments.regex_opt algorithm linked above):

\b(Self|a(?:bstract|s)|b(?:ecome|o(?:ol|x)|reak)|c(?:har|on(?:st|tinue)|rate)|do|e(?:lse|num|xtern)|f(?:32|64|alse|inal|n|or)|i(?:1(?:28|6)|32|64|mpl|size|[8fn])|l(?:et|oop)|m(?:a(?:cro|tch)|o(?:d|ve)|ut)|override|p(?:riv|ub)|re(?:f|turn)|s(?:elf|t(?:atic|r(?:(?:uct)?))|uper)|t(?:r(?:ait|ue|y)|ype(?:(?:of)?))|u(?:1(?:28|6)|32|64|8|ns(?:afe|ized)|s(?:(?:(?:iz)?)e))|virtual|wh(?:(?:er|il)e)|yield)\b

Compared to this (generated by putting some clothes around the result of words.join('|'):

\b(as|break|const|continue|crate|else|enum|extern|false|fn|for|if|impl|in|let|loop|match|mod|move|mut|pub|ref|return|self|Self|static|struct|super|trait|true|type|unsafe|use|where|while|abstract|become|box|do|final|macro|override|priv|typeof|unsized|virtual|yield|try|i8|i16|i32|i64|i128|isize|u8|u16|u32|u64|u128|usize|bool|char|str|f32|f64)\b

(These match something vaguely shaped like the list of Rust keywords and builtin types)

The runnable (after a small amount of legwork) benchmark is available here: https://gist.github.com/thomcc/eeb77d34d8924444b5778d2e451caac6

(Note that the actual operation it's benching isn't really what the highlighting loop looks like, although I guess it's not entirely different either)

Anyway, now that all that rambling and background si out of the way, you can probably guess what I'm asking for, but just to be concrete:

I would for to not need to perform this kind of optimization on the input. It seems like perhaps the regex crate could perform something like it itself, or maybe something else could be done.

P.S. Psst, I uh, also wouldn't mind some pointers for a workaround, if you've got any (unless it's likely to be an easy fix, anyway). Or more generally, advice for how to match against a long list of words efficiently:

Should I do nothing because the numbers I saw are atypical and in practice this optimization won't generally help?
Should I slightly rephrase the regexs so that another optimization can kick in? (I have some flexibility here, and can plausibly avoid the leading \b with some effort, for example).
Should I just port that python optimization to my code, and just run it on the input word-lists before tossing them to regex?
Should I instead use the aho_corasick (and/or fst) crate directly, and then figure out word boundaries after-the-fact using ~~crying~~ ~~guessing~~ my wit and ingenuity?

Jun 10 '21 17:06 thomcc

The main problem here is, I think, that the assumption that finite automata means this problem doesn't exist. That only happens in theory, where you can play in pure-DFA land. And indeed, if you use ASCII word boundaries instead of Unicode word boundaries, the perf difference between your two benchmarks not only disappears, but becomes a lot faster:

[andrew@frink issue787]$ cargo bench
   Compiling issue787 v0.1.0 (/home/andrew/tmp/regex/issue787)
    Finished bench [optimized + debuginfo] target(s) in 0.69s
     Running unittests (target/release/deps/issue787-79b0e732f8cd096d)

running 2 tests
test naive                   ... bench:   2,443,501 ns/iter (+/- 52,142)
test quote_optimized_unquote ... bench:   1,365,919 ns/iter (+/- 31,278)

test result: ok. 0 passed; 0 failed; 0 ignored; 2 measured; 0 filtered out; finished in 2.69s

[andrew@frink issue787]$ cargo bench
   Compiling issue787 v0.1.0 (/home/andrew/tmp/regex/issue787)
    Finished bench [optimized + debuginfo] target(s) in 0.68s
     Running unittests (target/release/deps/issue787-79b0e732f8cd096d)

running 2 tests
test naive                   ... bench:     297,995 ns/iter (+/- 37,022)
test quote_optimized_unquote ... bench:     297,612 ns/iter (+/- 17,471)

test result: ok. 0 passed; 0 failed; 0 ignored; 2 measured; 0 filtered out; finished in 0.55s

The latter was taken after simply adding (?-u) to the beginning of both regexes.

The issue here is the presence of the Unicode word boundary. It causes the Pike VM to be used because the lazy DFA can't handle it. (It is the only syntactic construct that the lazy DFA can't handle.) The Pike VM spends a lot of its time here shuffling the bytecode split instructions to handle the huge alternation. By applying your optimization, it spends less time doing that because more of its time will be spent inside a single (or fewer) alternations because you factored out common prefixes.

Another approach here is to remove the word boundaries altogether and the parens. This triggers a naive detection algorithm that diverts the regex to Aho-Corasick:

$ cargo bench naive
    Finished bench [optimized + debuginfo] target(s) in 0.00s
     Running unittests (target/release/deps/issue787-5d640db82910b0c1)

running 1 test
test naive                   ... bench:     836,392 ns/iter (+/- 23,497)

Regretably, this makes things slower, because Aho-Corasick says, "oh, there aren't that many patterns, we can use Teddy on this instead of a DFA!" And for whatever reason, Teddy winds up being slower here than a good ol' DFA. (It usually isn't. This is one of the first cases I've seen where the difference is so stark.)

Arguably, the regex crate should probably factor out common prefixes inside of alternations like you've done here, at the very least, to make the Pike VM and bounded backtracker run more quickly.

Jun 10 '21 18:06 BurntSushi

Thanks for the quick and thorough response!

And indeed, if you use ASCII word boundaries instead of Unicode word boundaries, the perf difference between your two benchmarks not only disappears, but becomes a lot faster

The latter was taken after simply adding (?-u) to the beginning of both regexes.

Ah, wonderful. Those numbers are much better. This is essentially what I was hoping for from the beginning.

Admittedly, losing Unicode there is at least a tiny bit unfortunate — The ASCII boundary isn't quite right: it will consider something like boolëan to match \b(bool)\b (which is a valid identifier in recent versions of Rust).

That said, it's hard for me to think getting edge cases like boolëan correct is worth a ~10x perf hit... This sort of syntax highlighting is almost always going to be approximate anyway, and the case where this matters seems... very unlikely to occur (and very likely to only cause relatively minor highlighting errors in most relatively robust grammars). So, I think I'm okay with sacrificing correctness in that case at the altar of those delicious cpu cycles.

(↑ admittedly dubious rationalization)

And for whatever reason, Teddy winds up being slower here than a good ol' DFA. (It usually isn't. This is one of the first cases I've seen where the difference is so stark.)

If nothing else, I'm glad I was able to shed light on a usage pattern which might be useful for benchmarking in the future.

Jun 10 '21 18:06 thomcc

Yes, the situation with Unicode word boundaries is regrettable. I wish I had a good fix for it. The only trick that's employed is that the lazy DFA will attempt to treat a Unicode word boundary as if it were an ASCII boundary, and if it ever sees a non-ASCII byte, the lazy DFA quits and falls over to the Pike VM. The input you gave to the benchmark is definitely not all ASCII, so this trick didn't apply.

But yes, I think your justification for using ASCII word boundaries is reasonably solid IMO.

If nothing else, I'm glad I was able to shed light on a usage pattern which might be useful for benchmarking in the future.

Indeed. Next time I'm in that code, I'll check this test case out in more detail. It takes a lot of time to page Teddy into context. :-/ I suspect the problem is some combination of 1) high false positive rate in the prefilter, 2) validation being slow and 3) frequency of matches.

Jun 10 '21 23:06 BurntSushi

Mostly just leaving myself a note here, but the regex optimization code you linked is not quite correct for leftmost-first or "preference order" matching. (Which is what this crate uses and also what all backtracking regex engines use.) For example, regex_opt(["sam", "samwise"]) yields (sam(?:(?:wise)?)), which is not the same as sam|samwise. (It is the same as samwise|sam though!) Demonstration:

$ python
Python 3.10.5 (main, Jun  6 2022, 18:49:26) [GCC 12.1.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> [m.group(0) for m in re.finditer('sam|samwise', 'samwise')]
['sam']
>>> [m.group(0) for m in re.finditer('samwise|sam', 'samwise')]
['samwise']
>>> [m.group(0) for m in re.finditer('(sam(?:(?:wise)?))', 'samwise')]
['samwise']

You'll get the same result with the regex crate.

So either pygments is using a regex engine where this optimization is correct, or none of their use cases trigger this bug, or they specifically work around it somehow or pygments has a bug somewhere.

I write this because I was thinking about how they did this without fouling things up, and admiring the simplicity of the regex_opt code. But the answer is that they don't account for this. The giveaway is that the set of strings is sorted before doing anything, thereby throwing away preference order.

I think this optimization is still doable somehow, but it is trickier than it looks. Alternatively, we only apply it if the result wouldn't alter match semantics. I think the rule for that would be: for any two strings A and B in the set where A is a prefix of B, A must appear after B in the set. Otherwise, the optimization cannot be applied.

I'm not 100% certain that rule is correct, but it's what popped out of my head.

Jul 08 '22 17:07 BurntSushi

See also: #891

Jul 19 '22 17:07 BurntSushi

OK, so I think I'll be able to call this fixed once #656 lands. In particular, I've come up with a way to optimize alternations of literals that effectively moves it closer to a DFA as opposed to a naive NFA via removing most epsilon transitions. The fundamental problem with large alternations is that they get compiled into an NFA where a single state has an epsilon transition to each of the branches in the alternation. So every time you hit that NFA state, you wind up having to follow the epsilon transitions and "try" every branch. (A lot like a backtracker...) A DFA effectively removes this by getting rid of all epsilon transitions and is thus much faster.

If your alternation is just a bunch of literals, one might imagine just shoving them into a trie and then compiling that down to an NFA. That works nicely because a trie is a DFA, and so something like bar|baz gets translated to ba(?:[rz]). That won't matter much in a specific case like that, but when the alternation is large, it turns out to matter quite a lot because it gets rid of that singular NFA state with an epsilon transition for every branch in the alternation. It's like getting a clog out.

The problem is that this crate (and all backtrackers) use leftmost-first or "preference order" for matching. As I described above, for example, sam|samwise can never actually match samwise because sam will always take priority. But that reasoning doesn't compose. For example, in \b(sam|samwise)\b, it is indeed possible for samwise to match because sam might not fall on a word boundary. So you also can't just drop literals from the alternation that have an earlier branch that is a prefix of itself.

Instead, what I came up with was to treat empty strings in the alternation as "barriers" that split up the alternation into chunks of branches. Branches within each chunk can be re-ordered freely, but you can't move a branch from one chunk to another chunk. It's a little hard to explain, but here's a before-and-after of what an NFA looks like for the regex samwise|foo|sam. Here's the before:

$ regex-cli-2022-09-21-before-literal-trie debug nfa thompson 'samwise|foo|sam'
>000000: binary-union(2, 1)
 000001: \x00-\xFF => 0
^000002: capture(pid=0, group=0, slot=0) => 13
 000003: s => 4
 000004: a => 5
 000005: m => 6
 000006: w => 7
 000007: i => 8
 000008: s => 9
 000009: e => 17
 000010: f => 11
 000011: o => 12
 000012: o => 17
 000013: union(3, 10, 14)
 000014: s => 15
 000015: a => 16
 000016: m => 17
 000017: capture(pid=0, group=0, slot=1) => 18
 000018: MATCH(0)

And now the after:

$ regex-cli debug nfa thompson 'samwise|foo|sam'
>000000: binary-union(2, 1)
 000001: \x00-\xFF => 0
^000002: capture(pid=0, group=0, slot=0) => 12
 000003: o => 13
 000004: o => 3
 000005: binary-union(9, 13)
 000006: e => 13
 000007: s => 6
 000008: i => 7
 000009: w => 8
 000010: m => 5
 000011: a => 10
 000012: sparse(f => 4, s => 11)
 000013: capture(pid=0, group=0, slot=1) => 14
 000014: MATCH(0)

You can see that in the former case, there is a union in state 13. That's the "clog" (on a small scale of course) that gets removed in the latter case.

Once the clog is removed, this leads to substantial gains:

$ regex-cli bench diff 2022-09-22-pre-literal-trie.csv 2022-09-22-post-literal-trie.csv -f 'dictionary/(english|i787)' -e 'hybrid|pikevm'
benchmark                           engine                 2022-09-22-pre-literal-trie  2022-09-22-post-literal-trie
---------                           ------                 ---------------------------  ----------------------------
count/dictionary/english            regex/automata/hybrid  40.7 MB/s                    40.8 MB/s
count/dictionary/english            regex/automata/pikevm  -                            13.2 MB/s
count/dictionary/english-tiny       regex/automata/hybrid  81.0 MB/s                    80.1 MB/s
count/dictionary/english-tiny       regex/automata/pikevm  360 B/s                      17.7 MB/s
count/dictionary/english-10         regex/automata/hybrid  1930 B/s                     247.8 MB/s
count/dictionary/english-10         regex/automata/pikevm  -                            11.2 MB/s
count/dictionary/english-15         regex/automata/hybrid  108.6 KB/s                   519.1 MB/s
count/dictionary/english-15         regex/automata/pikevm  15.5 KB/s                    13.8 MB/s
count/dictionary/i787-word-ascii    regex/automata/pikevm  5.1 MB/s                     28.8 MB/s
count/dictionary/i787-word-unicode  regex/automata/pikevm  4.9 MB/s                     25.3 MB/s
count/dictionary/i787-noword        regex/automata/pikevm  1414.0 KB/s                  22.1 MB/s
count/dictionary/i787-noword        regex/automata/hybrid  263.1 MB/s                   267.2 MB/s

Note the count/dictionary/english-10 and count/dictionary/english-15 benchmarks in particular. These benchmarks push the example you gave in this issue to the limit by including a subset of the English dictionary where only words with 10 or more (and 15 or more) characters are included. Both the PikeVM and the lazy DFA (regex/automata/hybrid) are absurdly slow before this optimization, and get quite reasonable after it. The PikeVM is still overall "slow," but at least it's serviceable. In some cases, the PikeVM ran so slow that I couldn't realistically capture benchmark data for it.

The i787 benchmarks are the benchmarks discussed in this issue. There's about a 5x speed up here for the PikeVM in particular. I also added an "opt" version of the benchmark, which is meant to test whether rewriting the alternation results in better or worse performance:

$ regex-cli bench cmp results.csv
benchmark                               pcre2/api/jit  pcre2/api/nojit  regex/api  regex/automata/pikevm
---------                               -------------  ---------------  ---------  ---------------------
count/dictionary/i787-opt-word-unicode  109.1 MB/s     1481.6 KB/s      95.7 MB/s  11.0 MB/s
count/dictionary/i787-word-unicode      88.3 MB/s      1400.4 KB/s      54.7 MB/s  25.2 MB/s

So in this case, the rewritten "optimized" variant does worse than the "unoptimized" variant. The trie optimization doesn't apply to the "optimized" variant because the optimized variant is not a simple alternation of literals. There is nesting happening that inhibits the somewhat simplified analysis I added. So I guess it turns out that the automatic trie optimization ends up doing better here. Indeed, looking at the compiled NFA, the "optimized" variant has several largeish union states, where as the trie optimization on the unoptimized variant has nothing more than binary-union. In other words, the "optimized" variant still has a clog, but one that is smaller than the clog in the "unoptimized" variant without the trie optimization.

The main downside of this optimization is that it's easily broken. Even something like samwise|f[oa]o|sam will break it, because it's no longer a plain alternation of literals. Enabling case insensitivity will break it too.

The "next level" version of this optimization is basically going to be a matter of selectively/smartly converting parts of an NFA into a DFA via epsilon removal. But I think we need a richer NFA data structure (like a full graph) to enable those.

Sep 22 '22 15:09 BurntSushi