memchr
memchr copied to clipboard
benchmarks: add ad hoc benchmark for 'quick-strings'
This is in response to @samuelcolvin's tweet about Rust's standard
library substring search routines being slow. This PR hooks up his naive
implementation of substring search to memchr's rebar harness and bakes
them off against the substring search implementations in std and
memchr across a wide variety of benchmarks.
I ran the benchmarks like so:
$ rebar build
$ rebar measure \
-e quick-strings \
-e rust/std/memmem/oneshot \
-e rust/std/memmem/prebuilt \
-e rust/memchr/memmem/oneshot \
-e rust/memchr/memmem/prebuilt | tee /tmp/results.csv
And here is a high level comparison using the geometric mean of relative speed ratios across all measured benchmarks (limited only to those on which all of std's, memchr's and quick-string's code is measured):
$ rebar rank --intersection /tmp/results.csv
Engine Version Geometric mean of speed ratios Benchmark count
------ ------- ------------------------------ ---------------
rust/memchr/memmem/prebuilt 2.7.4 1.05 38
rust/memchr/memmem/oneshot 2.7.4 1.22 38
rust/std/memmem/prebuilt 1.80.0-nightly b92758a9a 7.02 38
rust/std/memmem/oneshot 1.80.0-nightly b92758a9a 7.48 38
rust/quick-strings/memmem/oneshot 0.0.1 11.22 38
The "oneshot" benchmarks are the closest apples-to-apples comparison. Namely, they measure the time it takes to do a search of a needle in a haystack with a single function call. The "prebuilt" benchmarks, on the other hand, are allowed to accept a needle, build a searcher and then reuse that search for subsequent use. The "prebuilt" benchmarks demonstrate the advantages of a superior API when the needle is invariant. (For example, "find all occurrences of this needle in the lines of a file.")
For our purposes, and so we can narrow things down a bit, let's stick to "oneshot" benchmarks:
$ rebar rank /tmp/results.csv --intersection -e oneshot
Engine Version Geometric mean of speed ratios Benchmark count
------ ------- ------------------------------ ---------------
rust/memchr/memmem/oneshot 2.7.4 1.12 38
rust/std/memmem/oneshot 1.80.0-nightly b92758a9a 6.87 38
rust/quick-strings/memmem/oneshot 0.0.1 10.31 38
Here is a breakdown on the level of the individual benchmark:
$ rebar cmp /tmp/results.csv --intersection -e oneshot
benchmark rust/memchr/memmem/oneshot rust/quick-strings/memmem/oneshot rust/std/memmem/oneshot
--------- -------------------------- --------------------------------- -----------------------
memmem/code/rust-library-never-fn-strength 48.9 GB/s (1.00x) 1620.4 MB/s (30.87x) 1378.7 MB/s (36.28x)
memmem/code/rust-library-never-fn-strength-paren 48.3 GB/s (1.00x) 1637.2 MB/s (30.21x) 1482.8 MB/s (33.35x)
memmem/code/rust-library-never-fn-quux 29.4 GB/s (1.00x) 1637.2 MB/s (18.42x) 1540.9 MB/s (19.57x)
memmem/code/rust-library-rare-fn-from-str 48.9 GB/s (1.00x) 5.8 GB/s (8.40x) 1496.9 MB/s (33.44x)
memmem/pathological/md5-huge-no-hash 46.8 GB/s (1.00x) 1190.6 MB/s (40.27x) 8.7 GB/s (5.36x)
memmem/pathological/md5-huge-last-hash 46.7 GB/s (1.00x) 1196.0 MB/s (39.95x) 10.5 GB/s (4.45x)
memmem/pathological/rare-repeated-huge-tricky 51.2 GB/s (1.00x) 1984.3 MB/s (26.44x) 1240.3 MB/s (42.30x)
memmem/pathological/rare-repeated-small-tricky 21.2 GB/s (1.00x) 1936.4 MB/s (11.20x) 1202.3 MB/s (18.05x)
memmem/pathological/defeat-simple-vector-alphabet 5.4 GB/s (1.00x) 1069.8 MB/s (5.18x) 4.4 GB/s (1.24x)
memmem/pathological/defeat-simple-vector-freq-alphabet 19.2 GB/s (1.37x) 1167.3 MB/s (23.04x) 26.3 GB/s (1.00x)
memmem/pathological/defeat-simple-vector-repeated-alphabet 991.6 MB/s (1.00x) 22.9 MB/s (43.36x) 826.4 MB/s (1.20x)
memmem/subtitles/never/huge-en-john-watson 39.6 GB/s (1.00x) 1930.5 MB/s (21.03x) 1605.2 MB/s (25.29x)
memmem/subtitles/never/huge-en-all-common-bytes 43.9 GB/s (1.00x) 1377.7 MB/s (32.63x) 2.9 GB/s (15.12x)
memmem/subtitles/never/huge-en-some-rare-bytes 50.3 GB/s (1.00x) 1969.1 MB/s (26.17x) 3.2 GB/s (15.52x)
memmem/subtitles/never/huge-en-two-space 41.4 GB/s (1.00x) 727.8 MB/s (58.28x) 1158.8 MB/s (36.61x)
memmem/subtitles/never/teeny-en-john-watson 1027.0 MB/s (1.44x) 1483.5 MB/s (1.00x) 920.8 MB/s (1.61x)
memmem/subtitles/never/teeny-en-all-common-bytes 989.0 MB/s (1.50x) 1483.5 MB/s (1.00x) 1027.0 MB/s (1.44x)
memmem/subtitles/never/teeny-en-some-rare-bytes 1027.0 MB/s (1.30x) 1335.1 MB/s (1.00x) 1161.0 MB/s (1.15x)
memmem/subtitles/never/teeny-en-two-space 989.0 MB/s (1.12x) 1112.6 MB/s (1.00x) 953.7 MB/s (1.17x)
memmem/subtitles/never/huge-ru-john-watson 51.3 GB/s (1.00x) 475.6 MB/s (110.41x) 2.2 GB/s (23.42x)
memmem/subtitles/never/teeny-ru-john-watson 1178.1 MB/s (1.42x) 1668.9 MB/s (1.00x) 1001.4 MB/s (1.67x)
memmem/subtitles/never/huge-zh-john-watson 45.1 GB/s (1.00x) 1457.1 MB/s (31.69x) 6.5 GB/s (6.90x)
memmem/subtitles/never/teeny-zh-john-watson 1095.0 MB/s (1.50x) 1642.4 MB/s (1.00x) 953.7 MB/s (1.72x)
memmem/subtitles/rare/huge-en-sherlock-holmes 51.2 GB/s (1.00x) 1907.4 MB/s (27.48x) 2.5 GB/s (20.36x)
memmem/subtitles/rare/huge-en-sherlock 37.9 GB/s (1.00x) 1907.0 MB/s (20.33x) 2.7 GB/s (14.11x)
memmem/subtitles/rare/huge-en-medium-needle 48.0 GB/s (1.00x) 1358.2 MB/s (36.22x) 3.9 GB/s (12.27x)
memmem/subtitles/rare/huge-en-long-needle 43.2 GB/s (1.00x) 1715.0 MB/s (25.78x) 6.8 GB/s (6.32x)
memmem/subtitles/rare/huge-en-huge-needle 41.8 GB/s (1.00x) 1915.9 MB/s (22.37x) 7.0 GB/s (5.97x)
memmem/subtitles/rare/teeny-en-sherlock-holmes 989.0 MB/s (1.35x) 1335.1 MB/s (1.00x) 387.0 MB/s (3.45x)
memmem/subtitles/rare/teeny-en-sherlock 1112.6 MB/s (1.33x) 1483.5 MB/s (1.00x) 606.9 MB/s (2.44x)
memmem/subtitles/rare/huge-ru-sherlock-holmes 40.1 GB/s (1.00x) 479.5 MB/s (85.61x) 2.6 GB/s (15.64x)
memmem/subtitles/rare/huge-ru-sherlock 51.1 GB/s (1.00x) 479.5 MB/s (109.22x) 1733.9 MB/s (30.21x)
memmem/subtitles/rare/teeny-ru-sherlock-holmes 1112.6 MB/s (1.29x) 1430.5 MB/s (1.00x) 408.7 MB/s (3.50x)
memmem/subtitles/rare/teeny-ru-sherlock 1292.1 MB/s (1.29x) 1668.9 MB/s (1.00x) 656.6 MB/s (2.54x)
memmem/subtitles/rare/huge-zh-sherlock-holmes 49.0 GB/s (1.00x) 1080.8 MB/s (46.38x) 4.9 GB/s (9.92x)
memmem/subtitles/rare/huge-zh-sherlock 49.9 GB/s (1.00x) 1075.2 MB/s (47.56x) 3.4 GB/s (14.73x)
memmem/subtitles/rare/teeny-zh-sherlock-holmes 985.5 MB/s (1.50x) 1478.2 MB/s (1.00x) 343.8 MB/s (4.30x)
memmem/subtitles/rare/teeny-zh-sherlock 1182.6 MB/s (1.56x) 1847.7 MB/s (1.00x) 579.7 MB/s (3.19x)
Even in comparison to memchr, which has been optimized like crazy,
quick-strings actually does better on some benchmarks! But... if we
ask for only results where there is, say, a very large difference, watch
what happens:
$ rebar cmp /tmp/results.csv --intersection -e oneshot -t 5
benchmark rust/memchr/memmem/oneshot rust/quick-strings/memmem/oneshot rust/std/memmem/oneshot
--------- -------------------------- --------------------------------- -----------------------
memmem/code/rust-library-never-fn-strength 48.9 GB/s (1.00x) 1620.4 MB/s (30.87x) 1378.7 MB/s (36.28x)
memmem/code/rust-library-never-fn-strength-paren 48.3 GB/s (1.00x) 1637.2 MB/s (30.21x) 1482.8 MB/s (33.35x)
memmem/code/rust-library-never-fn-quux 29.4 GB/s (1.00x) 1637.2 MB/s (18.42x) 1540.9 MB/s (19.57x)
memmem/code/rust-library-rare-fn-from-str 48.9 GB/s (1.00x) 5.8 GB/s (8.40x) 1496.9 MB/s (33.44x)
memmem/pathological/md5-huge-no-hash 46.8 GB/s (1.00x) 1190.6 MB/s (40.27x) 8.7 GB/s (5.36x)
memmem/pathological/md5-huge-last-hash 46.7 GB/s (1.00x) 1196.0 MB/s (39.95x) 10.5 GB/s (4.45x)
memmem/pathological/rare-repeated-huge-tricky 51.2 GB/s (1.00x) 1984.3 MB/s (26.44x) 1240.3 MB/s (42.30x)
memmem/pathological/rare-repeated-small-tricky 21.2 GB/s (1.00x) 1936.4 MB/s (11.20x) 1202.3 MB/s (18.05x)
memmem/pathological/defeat-simple-vector-alphabet 5.4 GB/s (1.00x) 1069.8 MB/s (5.18x) 4.4 GB/s (1.24x)
memmem/pathological/defeat-simple-vector-freq-alphabet 19.2 GB/s (1.37x) 1167.3 MB/s (23.04x) 26.3 GB/s (1.00x)
memmem/pathological/defeat-simple-vector-repeated-alphabet 991.6 MB/s (1.00x) 22.9 MB/s (43.36x) 826.4 MB/s (1.20x)
memmem/subtitles/never/huge-en-john-watson 39.6 GB/s (1.00x) 1930.5 MB/s (21.03x) 1605.2 MB/s (25.29x)
memmem/subtitles/never/huge-en-all-common-bytes 43.9 GB/s (1.00x) 1377.7 MB/s (32.63x) 2.9 GB/s (15.12x)
memmem/subtitles/never/huge-en-some-rare-bytes 50.3 GB/s (1.00x) 1969.1 MB/s (26.17x) 3.2 GB/s (15.52x)
memmem/subtitles/never/huge-en-two-space 41.4 GB/s (1.00x) 727.8 MB/s (58.28x) 1158.8 MB/s (36.61x)
memmem/subtitles/never/huge-ru-john-watson 51.3 GB/s (1.00x) 475.6 MB/s (110.41x) 2.2 GB/s (23.42x)
memmem/subtitles/never/huge-zh-john-watson 45.1 GB/s (1.00x) 1457.1 MB/s (31.69x) 6.5 GB/s (6.90x)
memmem/subtitles/rare/huge-en-sherlock-holmes 51.2 GB/s (1.00x) 1907.4 MB/s (27.48x) 2.5 GB/s (20.36x)
memmem/subtitles/rare/huge-en-sherlock 37.9 GB/s (1.00x) 1907.0 MB/s (20.33x) 2.7 GB/s (14.11x)
memmem/subtitles/rare/huge-en-medium-needle 48.0 GB/s (1.00x) 1358.2 MB/s (36.22x) 3.9 GB/s (12.27x)
memmem/subtitles/rare/huge-en-long-needle 43.2 GB/s (1.00x) 1715.0 MB/s (25.78x) 6.8 GB/s (6.32x)
memmem/subtitles/rare/huge-en-huge-needle 41.8 GB/s (1.00x) 1915.9 MB/s (22.37x) 7.0 GB/s (5.97x)
memmem/subtitles/rare/huge-ru-sherlock-holmes 40.1 GB/s (1.00x) 479.5 MB/s (85.61x) 2.6 GB/s (15.64x)
memmem/subtitles/rare/huge-ru-sherlock 51.1 GB/s (1.00x) 479.5 MB/s (109.22x) 1733.9 MB/s (30.21x)
memmem/subtitles/rare/huge-zh-sherlock-holmes 49.0 GB/s (1.00x) 1080.8 MB/s (46.38x) 4.9 GB/s (9.92x)
memmem/subtitles/rare/huge-zh-sherlock 49.9 GB/s (1.00x) 1075.2 MB/s (47.56x) 3.4 GB/s (14.73x)
In every case, quick-strings is the one that is slower. Here's another
way to slice the data. Let's only look at benchmarks with non-teeny
haystacks:
[andrew@duff memchr]$ rebar cmp /tmp/results.csv --intersection -e oneshot -F teeny
benchmark rust/memchr/memmem/oneshot rust/quick-strings/memmem/oneshot rust/std/memmem/oneshot
--------- -------------------------- --------------------------------- -----------------------
memmem/code/rust-library-never-fn-strength 48.9 GB/s (1.00x) 1620.4 MB/s (30.87x) 1378.7 MB/s (36.28x)
memmem/code/rust-library-never-fn-strength-paren 48.3 GB/s (1.00x) 1637.2 MB/s (30.21x) 1482.8 MB/s (33.35x)
memmem/code/rust-library-never-fn-quux 29.4 GB/s (1.00x) 1637.2 MB/s (18.42x) 1540.9 MB/s (19.57x)
memmem/code/rust-library-rare-fn-from-str 48.9 GB/s (1.00x) 5.8 GB/s (8.40x) 1496.9 MB/s (33.44x)
memmem/pathological/md5-huge-no-hash 46.8 GB/s (1.00x) 1190.6 MB/s (40.27x) 8.7 GB/s (5.36x)
memmem/pathological/md5-huge-last-hash 46.7 GB/s (1.00x) 1196.0 MB/s (39.95x) 10.5 GB/s (4.45x)
memmem/pathological/rare-repeated-huge-tricky 51.2 GB/s (1.00x) 1984.3 MB/s (26.44x) 1240.3 MB/s (42.30x)
memmem/pathological/rare-repeated-small-tricky 21.2 GB/s (1.00x) 1936.4 MB/s (11.20x) 1202.3 MB/s (18.05x)
memmem/pathological/defeat-simple-vector-alphabet 5.4 GB/s (1.00x) 1069.8 MB/s (5.18x) 4.4 GB/s (1.24x)
memmem/pathological/defeat-simple-vector-freq-alphabet 19.2 GB/s (1.37x) 1167.3 MB/s (23.04x) 26.3 GB/s (1.00x)
memmem/pathological/defeat-simple-vector-repeated-alphabet 991.6 MB/s (1.00x) 22.9 MB/s (43.36x) 826.4 MB/s (1.20x)
memmem/subtitles/never/huge-en-john-watson 39.6 GB/s (1.00x) 1930.5 MB/s (21.03x) 1605.2 MB/s (25.29x)
memmem/subtitles/never/huge-en-all-common-bytes 43.9 GB/s (1.00x) 1377.7 MB/s (32.63x) 2.9 GB/s (15.12x)
memmem/subtitles/never/huge-en-some-rare-bytes 50.3 GB/s (1.00x) 1969.1 MB/s (26.17x) 3.2 GB/s (15.52x)
memmem/subtitles/never/huge-en-two-space 41.4 GB/s (1.00x) 727.8 MB/s (58.28x) 1158.8 MB/s (36.61x)
memmem/subtitles/never/huge-ru-john-watson 51.3 GB/s (1.00x) 475.6 MB/s (110.41x) 2.2 GB/s (23.42x)
memmem/subtitles/never/huge-zh-john-watson 45.1 GB/s (1.00x) 1457.1 MB/s (31.69x) 6.5 GB/s (6.90x)
memmem/subtitles/rare/huge-en-sherlock-holmes 51.2 GB/s (1.00x) 1907.4 MB/s (27.48x) 2.5 GB/s (20.36x)
memmem/subtitles/rare/huge-en-sherlock 37.9 GB/s (1.00x) 1907.0 MB/s (20.33x) 2.7 GB/s (14.11x)
memmem/subtitles/rare/huge-en-medium-needle 48.0 GB/s (1.00x) 1358.2 MB/s (36.22x) 3.9 GB/s (12.27x)
memmem/subtitles/rare/huge-en-long-needle 43.2 GB/s (1.00x) 1715.0 MB/s (25.78x) 6.8 GB/s (6.32x)
memmem/subtitles/rare/huge-en-huge-needle 41.8 GB/s (1.00x) 1915.9 MB/s (22.37x) 7.0 GB/s (5.97x)
memmem/subtitles/rare/huge-ru-sherlock-holmes 40.1 GB/s (1.00x) 479.5 MB/s (85.61x) 2.6 GB/s (15.64x)
memmem/subtitles/rare/huge-ru-sherlock 51.1 GB/s (1.00x) 479.5 MB/s (109.22x) 1733.9 MB/s (30.21x)
memmem/subtitles/rare/huge-zh-sherlock-holmes 49.0 GB/s (1.00x) 1080.8 MB/s (46.38x) 4.9 GB/s (9.92x)
memmem/subtitles/rare/huge-zh-sherlock 49.9 GB/s (1.00x) 1075.2 MB/s (47.56x) 3.4 GB/s (14.73x)
Again, quick-strings is slower in every case. Now let's look at
benchmarks with only teeny haystacks (the opposite of the above):
$ rebar cmp /tmp/results.csv --intersection -e oneshot -f teeny
benchmark rust/memchr/memmem/oneshot rust/quick-strings/memmem/oneshot rust/std/memmem/oneshot
--------- -------------------------- --------------------------------- -----------------------
memmem/subtitles/never/teeny-en-john-watson 1027.0 MB/s (1.44x) 1483.5 MB/s (1.00x) 920.8 MB/s (1.61x)
memmem/subtitles/never/teeny-en-all-common-bytes 989.0 MB/s (1.50x) 1483.5 MB/s (1.00x) 1027.0 MB/s (1.44x)
memmem/subtitles/never/teeny-en-some-rare-bytes 1027.0 MB/s (1.30x) 1335.1 MB/s (1.00x) 1161.0 MB/s (1.15x)
memmem/subtitles/never/teeny-en-two-space 989.0 MB/s (1.12x) 1112.6 MB/s (1.00x) 953.7 MB/s (1.17x)
memmem/subtitles/never/teeny-ru-john-watson 1178.1 MB/s (1.42x) 1668.9 MB/s (1.00x) 1001.4 MB/s (1.67x)
memmem/subtitles/never/teeny-zh-john-watson 1095.0 MB/s (1.50x) 1642.4 MB/s (1.00x) 953.7 MB/s (1.72x)
memmem/subtitles/rare/teeny-en-sherlock-holmes 989.0 MB/s (1.35x) 1335.1 MB/s (1.00x) 387.0 MB/s (3.45x)
memmem/subtitles/rare/teeny-en-sherlock 1112.6 MB/s (1.33x) 1483.5 MB/s (1.00x) 606.9 MB/s (2.44x)
memmem/subtitles/rare/teeny-ru-sherlock-holmes 1112.6 MB/s (1.29x) 1430.5 MB/s (1.00x) 408.7 MB/s (3.50x)
memmem/subtitles/rare/teeny-ru-sherlock 1292.1 MB/s (1.29x) 1668.9 MB/s (1.00x) 656.6 MB/s (2.54x)
memmem/subtitles/rare/teeny-zh-sherlock-holmes 985.5 MB/s (1.50x) 1478.2 MB/s (1.00x) 343.8 MB/s (4.30x)
memmem/subtitles/rare/teeny-zh-sherlock 1182.6 MB/s (1.56x) 1847.7 MB/s (1.00x) 579.7 MB/s (3.19x)
In this case, quick-strings is actually faster in every case!
My contextualization of these results is that quick-strings
implements a "naive" substring search algorithm that has worst case
O(m * n) time but very low startup costs. This means that if you
pick your benchmarks correctly, it can appear to be a lot faster
in some common cases. But once you extend beyond that or get large
haystacks, perf can really tank pretty considerably.
Substring search implementations have a lot of different cases to
cover. memchr in particular does try very hard to optimize for the
kinds of cases where quick-strings is fast, and this is likely why
it fairs a bit better than std does above on the teeny haystacks.
(For example, memchr is only 1.5x slower than quick-strings on
teeny-zh-sherlock-holmes, but std is 4.3x slower.)
The reason why it's difficult to get as fast as quick-strings (or
pretty much any naive implementation) is that there's a bunch of startup
overhead to pay for in a substring search implementation that uses
multiple different strategies. For example, memchr will use SIMD in
some cases, but only when the CPU supports it (via runtime querying).
And memchr will change which implementation it uses based on haystack
length. So there is some amount of case analysis being done before the
search can even begin to execute. In contrast, quick-strings can just
launch right into the search.
With that said, there's always room for improvement! memchr isn't
necessarily at the floor of an otherwise irreducible trade-off, but I
think that so long as you want to get high throughput for big haystacks
(like what ripgrep really wants to do), you probably need to at least
partially sell out some other types of tasks for very very small
haystacks.
This is fascinating, thank you so much.
I'll look now soon.
Perhaps we can just use memchr for arrow-rs.
It's worth noting that wrt LIKE and ILIKE queries:
- Most worries have quite a short needle
- 99.x% of rows will often be misses, so the cost of a match doesn't matter much.
memchr would be a good bet. If you're doing LIKE-style queries, then you should be able to build a memchr::memmem::Finder once and reuse it. And if you do that, it does much better on the short haystacks cases, even when comparing with quick-strings:
[andrew@duff memchr]$ rebar cmp /tmp/results.csv --intersection -e quick-strings -e memchr/memmem/prebuilt
benchmark rust/memchr/memmem/prebuilt rust/quick-strings/memmem/oneshot
--------- --------------------------- ---------------------------------
memmem/code/rust-library-never-fn-strength 49.0 GB/s (1.00x) 1620.4 MB/s (30.94x)
memmem/code/rust-library-never-fn-strength-paren 48.9 GB/s (1.00x) 1637.2 MB/s (30.59x)
memmem/code/rust-library-never-fn-quux 36.5 GB/s (1.00x) 1637.2 MB/s (22.86x)
memmem/code/rust-library-rare-fn-from-str 48.9 GB/s (1.00x) 5.8 GB/s (8.40x)
memmem/pathological/md5-huge-no-hash 40.3 GB/s (1.00x) 1190.6 MB/s (34.63x)
memmem/pathological/md5-huge-last-hash 47.3 GB/s (1.00x) 1196.0 MB/s (40.49x)
memmem/pathological/rare-repeated-huge-tricky 51.3 GB/s (1.00x) 1984.3 MB/s (26.47x)
memmem/pathological/rare-repeated-small-tricky 27.4 GB/s (1.00x) 1936.4 MB/s (14.50x)
memmem/pathological/defeat-simple-vector-alphabet 5.4 GB/s (1.00x) 1069.8 MB/s (5.18x)
memmem/pathological/defeat-simple-vector-freq-alphabet 19.2 GB/s (1.00x) 1167.3 MB/s (16.87x)
memmem/pathological/defeat-simple-vector-repeated-alphabet 991.9 MB/s (1.00x) 22.9 MB/s (43.38x)
memmem/subtitles/never/huge-en-john-watson 40.2 GB/s (1.00x) 1930.5 MB/s (21.34x)
memmem/subtitles/never/huge-en-all-common-bytes 45.6 GB/s (1.00x) 1377.7 MB/s (33.91x)
memmem/subtitles/never/huge-en-some-rare-bytes 49.2 GB/s (1.00x) 1969.1 MB/s (25.59x)
memmem/subtitles/never/huge-en-two-space 41.6 GB/s (1.00x) 727.8 MB/s (58.53x)
memmem/subtitles/never/teeny-en-john-watson 1668.9 MB/s (1.00x) 1483.5 MB/s (1.12x)
memmem/subtitles/never/teeny-en-all-common-bytes 1668.9 MB/s (1.00x) 1483.5 MB/s (1.12x)
memmem/subtitles/never/teeny-en-some-rare-bytes 1668.9 MB/s (1.00x) 1335.1 MB/s (1.25x)
memmem/subtitles/never/teeny-en-two-space 1668.9 MB/s (1.00x) 1112.6 MB/s (1.50x)
memmem/subtitles/never/huge-ru-john-watson 40.3 GB/s (1.00x) 475.6 MB/s (86.68x)
memmem/subtitles/never/teeny-ru-john-watson 2.6 GB/s (1.00x) 1668.9 MB/s (1.60x)
memmem/subtitles/never/huge-zh-john-watson 49.8 GB/s (1.00x) 1457.1 MB/s (35.00x)
memmem/subtitles/never/teeny-zh-john-watson 1847.7 MB/s (1.00x) 1642.4 MB/s (1.12x)
memmem/subtitles/rare/huge-en-sherlock-holmes 51.4 GB/s (1.00x) 1907.4 MB/s (27.58x)
memmem/subtitles/rare/huge-en-sherlock 50.8 GB/s (1.00x) 1907.0 MB/s (27.26x)
memmem/subtitles/rare/huge-en-medium-needle 36.9 GB/s (1.00x) 1358.2 MB/s (27.84x)
memmem/subtitles/rare/huge-en-long-needle 43.2 GB/s (1.00x) 1715.0 MB/s (25.82x)
memmem/subtitles/rare/huge-en-huge-needle 39.3 GB/s (1.00x) 1915.9 MB/s (21.03x)
memmem/subtitles/rare/teeny-en-sherlock-holmes 1570.8 MB/s (1.00x) 1335.1 MB/s (1.18x)
memmem/subtitles/rare/teeny-en-sherlock 1335.1 MB/s (1.11x) 1483.5 MB/s (1.00x)
memmem/subtitles/rare/huge-ru-sherlock-holmes 50.3 GB/s (1.00x) 479.5 MB/s (107.49x)
memmem/subtitles/rare/huge-ru-sherlock 51.2 GB/s (1.00x) 479.5 MB/s (109.32x)
memmem/subtitles/rare/teeny-ru-sherlock-holmes 2.1 GB/s (1.00x) 1430.5 MB/s (1.47x)
memmem/subtitles/rare/teeny-ru-sherlock 1741.5 MB/s (1.00x) 1668.9 MB/s (1.04x)
memmem/subtitles/rare/huge-zh-sherlock-holmes 49.1 GB/s (1.00x) 1080.8 MB/s (46.50x)
memmem/subtitles/rare/huge-zh-sherlock 50.0 GB/s (1.00x) 1075.2 MB/s (47.60x)
memmem/subtitles/rare/teeny-zh-sherlock-holmes 1137.1 MB/s (1.30x) 1478.2 MB/s (1.00x)
memmem/subtitles/rare/teeny-zh-sherlock 1231.8 MB/s (1.50x) 1847.7 MB/s (1.00x)
std doesn't expose any sort of analogous API for a prebuilt searcher based on just the needle.
That said, memchr doesn't do case insensitive search. That's a different can of worms, especially if you care about Unicode support. A regex::Regex supports it, but there's a lot more machinery there and its overhead is even bigger than memchr::memmem. If you only need ASCII case insensitivity, then there is AhoCorasickBuilder::ascii_case_insensitive, although I don't believe that's had a ton of optimization work put into it, so YMMV.
Thanks @BurntSushi,
That said, memchr doesn't do case insensitive search.
Here arrow-rs already falls back to using regex::Regex, the idea for quick-strings is for the very common cases where a LIKE or ILIKE query can be represented as starts_with, ends_with, contains, istarts_with, iends_with or icontains.
For that reason, we only need ascii case insensitive methods — if the needle is case insensitive, we use that methods, otherwise we can just fallback to using regex::Regex.
Ascii needles make up (in my experience) a very large proportion of queries, hence the value in optimising that case.
Is there any interest in adding ascii only case-insensitive methods to memchr? I would really like to use just memchr and not need to use any mof the logic from quick-strings.
cc @alamb who I think might be interested in this discussion.
The "search in ASCII case insensitive mode" is something that I think would probably have a good home in this crate, but it is a big scope increase. That is, it's something that I do think would be nice to support, but it's a tall order. Of course, an initial implementation could be naive (as long as it's not naive algorithmically, i.e., optimal worst case behavior) I suppose. But there are likely SIMD tricks specific to ASCII case insensitive search that we'll want to employ. That's probably easily ~1 month of my free coding time in the evenings.
But yeah, your strategy makes sense. I suspect what you're fighting is actually just regex overhead itself. The regex crate is, quite literally, doing exactly the optimization you're trying to do manually. The actual regex engine won't be used at all for simple ASCII-only literal queries. It will just defer to this crate. But there are abstraction layers it has to cut through at search time... Maybe there is a way to improve those, but I suspect there will always be cases like this where it will be faster for the caller to do this sort of optimization on their own unfortunately. If you were searching larger haystacks, then you wouldn't have to worry about this at all.
An approach we have taken elsewhere in arrow-rs is to literally special case the different implementations
So maybe we could do something like this:
if haystack.len() < 1024 { // or whatever the constant happens to be
// use small-strings implementation
} else {
// use more general purpose std or memchr etc
}
This would of course result in more code / bigger binary
. The "prebuilt" benchmarks, on the other hand, are allowed to accept a needle, build a searcher and then reuse that search for subsequent use. The "prebuilt" benchmarks demonstrate the advantages of a superior API when the needle is invariant.
I think this is an excellent observation @BurntSushi
@samuelcolvin maybe this is a good hint that we could make the arrow-rs substring kernels behave more like the regex kernes and "compile / prepare" the matcher
Given the kernels typically operate on batches of 4000 or 8000ish rows we can probably amortize the "prepare the needle" step well
See https://github.com/apache/arrow-rs/pull/6118, https://github.com/apache/arrow-rs/pull/6128 and https://github.com/apache/arrow-rs/pull/6131 for implementation in arrow-rs.