memchr
memchr copied to clipboard
benchmarks: add ad hoc benchmark for 'quick-strings'
This is in response to @samuelcolvin's tweet about Rust's standard
library substring search routines being slow. This PR hooks up his naive
implementation of substring search to memchr's rebar harness and bakes
them off against the substring search implementations in std
and
memchr
across a wide variety of benchmarks.
I ran the benchmarks like so:
$ rebar build
$ rebar measure \
-e quick-strings \
-e rust/std/memmem/oneshot \
-e rust/std/memmem/prebuilt \
-e rust/memchr/memmem/oneshot \
-e rust/memchr/memmem/prebuilt | tee /tmp/results.csv
And here is a high level comparison using the geometric mean of relative speed ratios across all measured benchmarks (limited only to those on which all of std's, memchr's and quick-string's code is measured):
$ rebar rank --intersection /tmp/results.csv
Engine Version Geometric mean of speed ratios Benchmark count
------ ------- ------------------------------ ---------------
rust/memchr/memmem/prebuilt 2.7.4 1.05 38
rust/memchr/memmem/oneshot 2.7.4 1.22 38
rust/std/memmem/prebuilt 1.80.0-nightly b92758a9a 7.02 38
rust/std/memmem/oneshot 1.80.0-nightly b92758a9a 7.48 38
rust/quick-strings/memmem/oneshot 0.0.1 11.22 38
The "oneshot" benchmarks are the closest apples-to-apples comparison. Namely, they measure the time it takes to do a search of a needle in a haystack with a single function call. The "prebuilt" benchmarks, on the other hand, are allowed to accept a needle, build a searcher and then reuse that search for subsequent use. The "prebuilt" benchmarks demonstrate the advantages of a superior API when the needle is invariant. (For example, "find all occurrences of this needle in the lines of a file.")
For our purposes, and so we can narrow things down a bit, let's stick to "oneshot" benchmarks:
$ rebar rank /tmp/results.csv --intersection -e oneshot
Engine Version Geometric mean of speed ratios Benchmark count
------ ------- ------------------------------ ---------------
rust/memchr/memmem/oneshot 2.7.4 1.12 38
rust/std/memmem/oneshot 1.80.0-nightly b92758a9a 6.87 38
rust/quick-strings/memmem/oneshot 0.0.1 10.31 38
Here is a breakdown on the level of the individual benchmark:
$ rebar cmp /tmp/results.csv --intersection -e oneshot
benchmark rust/memchr/memmem/oneshot rust/quick-strings/memmem/oneshot rust/std/memmem/oneshot
--------- -------------------------- --------------------------------- -----------------------
memmem/code/rust-library-never-fn-strength 48.9 GB/s (1.00x) 1620.4 MB/s (30.87x) 1378.7 MB/s (36.28x)
memmem/code/rust-library-never-fn-strength-paren 48.3 GB/s (1.00x) 1637.2 MB/s (30.21x) 1482.8 MB/s (33.35x)
memmem/code/rust-library-never-fn-quux 29.4 GB/s (1.00x) 1637.2 MB/s (18.42x) 1540.9 MB/s (19.57x)
memmem/code/rust-library-rare-fn-from-str 48.9 GB/s (1.00x) 5.8 GB/s (8.40x) 1496.9 MB/s (33.44x)
memmem/pathological/md5-huge-no-hash 46.8 GB/s (1.00x) 1190.6 MB/s (40.27x) 8.7 GB/s (5.36x)
memmem/pathological/md5-huge-last-hash 46.7 GB/s (1.00x) 1196.0 MB/s (39.95x) 10.5 GB/s (4.45x)
memmem/pathological/rare-repeated-huge-tricky 51.2 GB/s (1.00x) 1984.3 MB/s (26.44x) 1240.3 MB/s (42.30x)
memmem/pathological/rare-repeated-small-tricky 21.2 GB/s (1.00x) 1936.4 MB/s (11.20x) 1202.3 MB/s (18.05x)
memmem/pathological/defeat-simple-vector-alphabet 5.4 GB/s (1.00x) 1069.8 MB/s (5.18x) 4.4 GB/s (1.24x)
memmem/pathological/defeat-simple-vector-freq-alphabet 19.2 GB/s (1.37x) 1167.3 MB/s (23.04x) 26.3 GB/s (1.00x)
memmem/pathological/defeat-simple-vector-repeated-alphabet 991.6 MB/s (1.00x) 22.9 MB/s (43.36x) 826.4 MB/s (1.20x)
memmem/subtitles/never/huge-en-john-watson 39.6 GB/s (1.00x) 1930.5 MB/s (21.03x) 1605.2 MB/s (25.29x)
memmem/subtitles/never/huge-en-all-common-bytes 43.9 GB/s (1.00x) 1377.7 MB/s (32.63x) 2.9 GB/s (15.12x)
memmem/subtitles/never/huge-en-some-rare-bytes 50.3 GB/s (1.00x) 1969.1 MB/s (26.17x) 3.2 GB/s (15.52x)
memmem/subtitles/never/huge-en-two-space 41.4 GB/s (1.00x) 727.8 MB/s (58.28x) 1158.8 MB/s (36.61x)
memmem/subtitles/never/teeny-en-john-watson 1027.0 MB/s (1.44x) 1483.5 MB/s (1.00x) 920.8 MB/s (1.61x)
memmem/subtitles/never/teeny-en-all-common-bytes 989.0 MB/s (1.50x) 1483.5 MB/s (1.00x) 1027.0 MB/s (1.44x)
memmem/subtitles/never/teeny-en-some-rare-bytes 1027.0 MB/s (1.30x) 1335.1 MB/s (1.00x) 1161.0 MB/s (1.15x)
memmem/subtitles/never/teeny-en-two-space 989.0 MB/s (1.12x) 1112.6 MB/s (1.00x) 953.7 MB/s (1.17x)
memmem/subtitles/never/huge-ru-john-watson 51.3 GB/s (1.00x) 475.6 MB/s (110.41x) 2.2 GB/s (23.42x)
memmem/subtitles/never/teeny-ru-john-watson 1178.1 MB/s (1.42x) 1668.9 MB/s (1.00x) 1001.4 MB/s (1.67x)
memmem/subtitles/never/huge-zh-john-watson 45.1 GB/s (1.00x) 1457.1 MB/s (31.69x) 6.5 GB/s (6.90x)
memmem/subtitles/never/teeny-zh-john-watson 1095.0 MB/s (1.50x) 1642.4 MB/s (1.00x) 953.7 MB/s (1.72x)
memmem/subtitles/rare/huge-en-sherlock-holmes 51.2 GB/s (1.00x) 1907.4 MB/s (27.48x) 2.5 GB/s (20.36x)
memmem/subtitles/rare/huge-en-sherlock 37.9 GB/s (1.00x) 1907.0 MB/s (20.33x) 2.7 GB/s (14.11x)
memmem/subtitles/rare/huge-en-medium-needle 48.0 GB/s (1.00x) 1358.2 MB/s (36.22x) 3.9 GB/s (12.27x)
memmem/subtitles/rare/huge-en-long-needle 43.2 GB/s (1.00x) 1715.0 MB/s (25.78x) 6.8 GB/s (6.32x)
memmem/subtitles/rare/huge-en-huge-needle 41.8 GB/s (1.00x) 1915.9 MB/s (22.37x) 7.0 GB/s (5.97x)
memmem/subtitles/rare/teeny-en-sherlock-holmes 989.0 MB/s (1.35x) 1335.1 MB/s (1.00x) 387.0 MB/s (3.45x)
memmem/subtitles/rare/teeny-en-sherlock 1112.6 MB/s (1.33x) 1483.5 MB/s (1.00x) 606.9 MB/s (2.44x)
memmem/subtitles/rare/huge-ru-sherlock-holmes 40.1 GB/s (1.00x) 479.5 MB/s (85.61x) 2.6 GB/s (15.64x)
memmem/subtitles/rare/huge-ru-sherlock 51.1 GB/s (1.00x) 479.5 MB/s (109.22x) 1733.9 MB/s (30.21x)
memmem/subtitles/rare/teeny-ru-sherlock-holmes 1112.6 MB/s (1.29x) 1430.5 MB/s (1.00x) 408.7 MB/s (3.50x)
memmem/subtitles/rare/teeny-ru-sherlock 1292.1 MB/s (1.29x) 1668.9 MB/s (1.00x) 656.6 MB/s (2.54x)
memmem/subtitles/rare/huge-zh-sherlock-holmes 49.0 GB/s (1.00x) 1080.8 MB/s (46.38x) 4.9 GB/s (9.92x)
memmem/subtitles/rare/huge-zh-sherlock 49.9 GB/s (1.00x) 1075.2 MB/s (47.56x) 3.4 GB/s (14.73x)
memmem/subtitles/rare/teeny-zh-sherlock-holmes 985.5 MB/s (1.50x) 1478.2 MB/s (1.00x) 343.8 MB/s (4.30x)
memmem/subtitles/rare/teeny-zh-sherlock 1182.6 MB/s (1.56x) 1847.7 MB/s (1.00x) 579.7 MB/s (3.19x)
Even in comparison to memchr
, which has been optimized like crazy,
quick-strings
actually does better on some benchmarks! But... if we
ask for only results where there is, say, a very large difference, watch
what happens:
$ rebar cmp /tmp/results.csv --intersection -e oneshot -t 5
benchmark rust/memchr/memmem/oneshot rust/quick-strings/memmem/oneshot rust/std/memmem/oneshot
--------- -------------------------- --------------------------------- -----------------------
memmem/code/rust-library-never-fn-strength 48.9 GB/s (1.00x) 1620.4 MB/s (30.87x) 1378.7 MB/s (36.28x)
memmem/code/rust-library-never-fn-strength-paren 48.3 GB/s (1.00x) 1637.2 MB/s (30.21x) 1482.8 MB/s (33.35x)
memmem/code/rust-library-never-fn-quux 29.4 GB/s (1.00x) 1637.2 MB/s (18.42x) 1540.9 MB/s (19.57x)
memmem/code/rust-library-rare-fn-from-str 48.9 GB/s (1.00x) 5.8 GB/s (8.40x) 1496.9 MB/s (33.44x)
memmem/pathological/md5-huge-no-hash 46.8 GB/s (1.00x) 1190.6 MB/s (40.27x) 8.7 GB/s (5.36x)
memmem/pathological/md5-huge-last-hash 46.7 GB/s (1.00x) 1196.0 MB/s (39.95x) 10.5 GB/s (4.45x)
memmem/pathological/rare-repeated-huge-tricky 51.2 GB/s (1.00x) 1984.3 MB/s (26.44x) 1240.3 MB/s (42.30x)
memmem/pathological/rare-repeated-small-tricky 21.2 GB/s (1.00x) 1936.4 MB/s (11.20x) 1202.3 MB/s (18.05x)
memmem/pathological/defeat-simple-vector-alphabet 5.4 GB/s (1.00x) 1069.8 MB/s (5.18x) 4.4 GB/s (1.24x)
memmem/pathological/defeat-simple-vector-freq-alphabet 19.2 GB/s (1.37x) 1167.3 MB/s (23.04x) 26.3 GB/s (1.00x)
memmem/pathological/defeat-simple-vector-repeated-alphabet 991.6 MB/s (1.00x) 22.9 MB/s (43.36x) 826.4 MB/s (1.20x)
memmem/subtitles/never/huge-en-john-watson 39.6 GB/s (1.00x) 1930.5 MB/s (21.03x) 1605.2 MB/s (25.29x)
memmem/subtitles/never/huge-en-all-common-bytes 43.9 GB/s (1.00x) 1377.7 MB/s (32.63x) 2.9 GB/s (15.12x)
memmem/subtitles/never/huge-en-some-rare-bytes 50.3 GB/s (1.00x) 1969.1 MB/s (26.17x) 3.2 GB/s (15.52x)
memmem/subtitles/never/huge-en-two-space 41.4 GB/s (1.00x) 727.8 MB/s (58.28x) 1158.8 MB/s (36.61x)
memmem/subtitles/never/huge-ru-john-watson 51.3 GB/s (1.00x) 475.6 MB/s (110.41x) 2.2 GB/s (23.42x)
memmem/subtitles/never/huge-zh-john-watson 45.1 GB/s (1.00x) 1457.1 MB/s (31.69x) 6.5 GB/s (6.90x)
memmem/subtitles/rare/huge-en-sherlock-holmes 51.2 GB/s (1.00x) 1907.4 MB/s (27.48x) 2.5 GB/s (20.36x)
memmem/subtitles/rare/huge-en-sherlock 37.9 GB/s (1.00x) 1907.0 MB/s (20.33x) 2.7 GB/s (14.11x)
memmem/subtitles/rare/huge-en-medium-needle 48.0 GB/s (1.00x) 1358.2 MB/s (36.22x) 3.9 GB/s (12.27x)
memmem/subtitles/rare/huge-en-long-needle 43.2 GB/s (1.00x) 1715.0 MB/s (25.78x) 6.8 GB/s (6.32x)
memmem/subtitles/rare/huge-en-huge-needle 41.8 GB/s (1.00x) 1915.9 MB/s (22.37x) 7.0 GB/s (5.97x)
memmem/subtitles/rare/huge-ru-sherlock-holmes 40.1 GB/s (1.00x) 479.5 MB/s (85.61x) 2.6 GB/s (15.64x)
memmem/subtitles/rare/huge-ru-sherlock 51.1 GB/s (1.00x) 479.5 MB/s (109.22x) 1733.9 MB/s (30.21x)
memmem/subtitles/rare/huge-zh-sherlock-holmes 49.0 GB/s (1.00x) 1080.8 MB/s (46.38x) 4.9 GB/s (9.92x)
memmem/subtitles/rare/huge-zh-sherlock 49.9 GB/s (1.00x) 1075.2 MB/s (47.56x) 3.4 GB/s (14.73x)
In every case, quick-strings
is the one that is slower. Here's another
way to slice the data. Let's only look at benchmarks with non-teeny
haystacks:
[andrew@duff memchr]$ rebar cmp /tmp/results.csv --intersection -e oneshot -F teeny
benchmark rust/memchr/memmem/oneshot rust/quick-strings/memmem/oneshot rust/std/memmem/oneshot
--------- -------------------------- --------------------------------- -----------------------
memmem/code/rust-library-never-fn-strength 48.9 GB/s (1.00x) 1620.4 MB/s (30.87x) 1378.7 MB/s (36.28x)
memmem/code/rust-library-never-fn-strength-paren 48.3 GB/s (1.00x) 1637.2 MB/s (30.21x) 1482.8 MB/s (33.35x)
memmem/code/rust-library-never-fn-quux 29.4 GB/s (1.00x) 1637.2 MB/s (18.42x) 1540.9 MB/s (19.57x)
memmem/code/rust-library-rare-fn-from-str 48.9 GB/s (1.00x) 5.8 GB/s (8.40x) 1496.9 MB/s (33.44x)
memmem/pathological/md5-huge-no-hash 46.8 GB/s (1.00x) 1190.6 MB/s (40.27x) 8.7 GB/s (5.36x)
memmem/pathological/md5-huge-last-hash 46.7 GB/s (1.00x) 1196.0 MB/s (39.95x) 10.5 GB/s (4.45x)
memmem/pathological/rare-repeated-huge-tricky 51.2 GB/s (1.00x) 1984.3 MB/s (26.44x) 1240.3 MB/s (42.30x)
memmem/pathological/rare-repeated-small-tricky 21.2 GB/s (1.00x) 1936.4 MB/s (11.20x) 1202.3 MB/s (18.05x)
memmem/pathological/defeat-simple-vector-alphabet 5.4 GB/s (1.00x) 1069.8 MB/s (5.18x) 4.4 GB/s (1.24x)
memmem/pathological/defeat-simple-vector-freq-alphabet 19.2 GB/s (1.37x) 1167.3 MB/s (23.04x) 26.3 GB/s (1.00x)
memmem/pathological/defeat-simple-vector-repeated-alphabet 991.6 MB/s (1.00x) 22.9 MB/s (43.36x) 826.4 MB/s (1.20x)
memmem/subtitles/never/huge-en-john-watson 39.6 GB/s (1.00x) 1930.5 MB/s (21.03x) 1605.2 MB/s (25.29x)
memmem/subtitles/never/huge-en-all-common-bytes 43.9 GB/s (1.00x) 1377.7 MB/s (32.63x) 2.9 GB/s (15.12x)
memmem/subtitles/never/huge-en-some-rare-bytes 50.3 GB/s (1.00x) 1969.1 MB/s (26.17x) 3.2 GB/s (15.52x)
memmem/subtitles/never/huge-en-two-space 41.4 GB/s (1.00x) 727.8 MB/s (58.28x) 1158.8 MB/s (36.61x)
memmem/subtitles/never/huge-ru-john-watson 51.3 GB/s (1.00x) 475.6 MB/s (110.41x) 2.2 GB/s (23.42x)
memmem/subtitles/never/huge-zh-john-watson 45.1 GB/s (1.00x) 1457.1 MB/s (31.69x) 6.5 GB/s (6.90x)
memmem/subtitles/rare/huge-en-sherlock-holmes 51.2 GB/s (1.00x) 1907.4 MB/s (27.48x) 2.5 GB/s (20.36x)
memmem/subtitles/rare/huge-en-sherlock 37.9 GB/s (1.00x) 1907.0 MB/s (20.33x) 2.7 GB/s (14.11x)
memmem/subtitles/rare/huge-en-medium-needle 48.0 GB/s (1.00x) 1358.2 MB/s (36.22x) 3.9 GB/s (12.27x)
memmem/subtitles/rare/huge-en-long-needle 43.2 GB/s (1.00x) 1715.0 MB/s (25.78x) 6.8 GB/s (6.32x)
memmem/subtitles/rare/huge-en-huge-needle 41.8 GB/s (1.00x) 1915.9 MB/s (22.37x) 7.0 GB/s (5.97x)
memmem/subtitles/rare/huge-ru-sherlock-holmes 40.1 GB/s (1.00x) 479.5 MB/s (85.61x) 2.6 GB/s (15.64x)
memmem/subtitles/rare/huge-ru-sherlock 51.1 GB/s (1.00x) 479.5 MB/s (109.22x) 1733.9 MB/s (30.21x)
memmem/subtitles/rare/huge-zh-sherlock-holmes 49.0 GB/s (1.00x) 1080.8 MB/s (46.38x) 4.9 GB/s (9.92x)
memmem/subtitles/rare/huge-zh-sherlock 49.9 GB/s (1.00x) 1075.2 MB/s (47.56x) 3.4 GB/s (14.73x)
Again, quick-strings
is slower in every case. Now let's look at
benchmarks with only teeny haystacks (the opposite of the above):
$ rebar cmp /tmp/results.csv --intersection -e oneshot -f teeny
benchmark rust/memchr/memmem/oneshot rust/quick-strings/memmem/oneshot rust/std/memmem/oneshot
--------- -------------------------- --------------------------------- -----------------------
memmem/subtitles/never/teeny-en-john-watson 1027.0 MB/s (1.44x) 1483.5 MB/s (1.00x) 920.8 MB/s (1.61x)
memmem/subtitles/never/teeny-en-all-common-bytes 989.0 MB/s (1.50x) 1483.5 MB/s (1.00x) 1027.0 MB/s (1.44x)
memmem/subtitles/never/teeny-en-some-rare-bytes 1027.0 MB/s (1.30x) 1335.1 MB/s (1.00x) 1161.0 MB/s (1.15x)
memmem/subtitles/never/teeny-en-two-space 989.0 MB/s (1.12x) 1112.6 MB/s (1.00x) 953.7 MB/s (1.17x)
memmem/subtitles/never/teeny-ru-john-watson 1178.1 MB/s (1.42x) 1668.9 MB/s (1.00x) 1001.4 MB/s (1.67x)
memmem/subtitles/never/teeny-zh-john-watson 1095.0 MB/s (1.50x) 1642.4 MB/s (1.00x) 953.7 MB/s (1.72x)
memmem/subtitles/rare/teeny-en-sherlock-holmes 989.0 MB/s (1.35x) 1335.1 MB/s (1.00x) 387.0 MB/s (3.45x)
memmem/subtitles/rare/teeny-en-sherlock 1112.6 MB/s (1.33x) 1483.5 MB/s (1.00x) 606.9 MB/s (2.44x)
memmem/subtitles/rare/teeny-ru-sherlock-holmes 1112.6 MB/s (1.29x) 1430.5 MB/s (1.00x) 408.7 MB/s (3.50x)
memmem/subtitles/rare/teeny-ru-sherlock 1292.1 MB/s (1.29x) 1668.9 MB/s (1.00x) 656.6 MB/s (2.54x)
memmem/subtitles/rare/teeny-zh-sherlock-holmes 985.5 MB/s (1.50x) 1478.2 MB/s (1.00x) 343.8 MB/s (4.30x)
memmem/subtitles/rare/teeny-zh-sherlock 1182.6 MB/s (1.56x) 1847.7 MB/s (1.00x) 579.7 MB/s (3.19x)
In this case, quick-strings
is actually faster in every case!
My contextualization of these results is that quick-strings
implements a "naive" substring search algorithm that has worst case
O(m * n)
time but very low startup costs. This means that if you
pick your benchmarks correctly, it can appear to be a lot faster
in some common cases. But once you extend beyond that or get large
haystacks, perf can really tank pretty considerably.
Substring search implementations have a lot of different cases to
cover. memchr
in particular does try very hard to optimize for the
kinds of cases where quick-strings
is fast, and this is likely why
it fairs a bit better than std
does above on the teeny haystacks.
(For example, memchr
is only 1.5x slower than quick-strings
on
teeny-zh-sherlock-holmes
, but std is 4.3x slower.)
The reason why it's difficult to get as fast as quick-strings
(or
pretty much any naive implementation) is that there's a bunch of startup
overhead to pay for in a substring search implementation that uses
multiple different strategies. For example, memchr
will use SIMD in
some cases, but only when the CPU supports it (via runtime querying).
And memchr
will change which implementation it uses based on haystack
length. So there is some amount of case analysis being done before the
search can even begin to execute. In contrast, quick-strings
can just
launch right into the search.
With that said, there's always room for improvement! memchr
isn't
necessarily at the floor of an otherwise irreducible trade-off, but I
think that so long as you want to get high throughput for big haystacks
(like what ripgrep really wants to do), you probably need to at least
partially sell out some other types of tasks for very very small
haystacks.