memchr icon indicating copy to clipboard operation
memchr copied to clipboard

benchmarks: add ad hoc benchmark for 'quick-strings'

Open BurntSushi opened this issue 7 months ago • 8 comments

This is in response to @samuelcolvin's tweet about Rust's standard library substring search routines being slow. This PR hooks up his naive implementation of substring search to memchr's rebar harness and bakes them off against the substring search implementations in std and memchr across a wide variety of benchmarks.

I ran the benchmarks like so:

$ rebar build
$ rebar measure \
    -e quick-strings \
    -e rust/std/memmem/oneshot \
    -e rust/std/memmem/prebuilt \
    -e rust/memchr/memmem/oneshot \
    -e rust/memchr/memmem/prebuilt | tee /tmp/results.csv

And here is a high level comparison using the geometric mean of relative speed ratios across all measured benchmarks (limited only to those on which all of std's, memchr's and quick-string's code is measured):

$ rebar rank --intersection /tmp/results.csv
Engine                             Version                   Geometric mean of speed ratios  Benchmark count
------                             -------                   ------------------------------  ---------------
rust/memchr/memmem/prebuilt        2.7.4                     1.05                            38
rust/memchr/memmem/oneshot         2.7.4                     1.22                            38
rust/std/memmem/prebuilt           1.80.0-nightly b92758a9a  7.02                            38
rust/std/memmem/oneshot            1.80.0-nightly b92758a9a  7.48                            38
rust/quick-strings/memmem/oneshot  0.0.1                     11.22                           38

The "oneshot" benchmarks are the closest apples-to-apples comparison. Namely, they measure the time it takes to do a search of a needle in a haystack with a single function call. The "prebuilt" benchmarks, on the other hand, are allowed to accept a needle, build a searcher and then reuse that search for subsequent use. The "prebuilt" benchmarks demonstrate the advantages of a superior API when the needle is invariant. (For example, "find all occurrences of this needle in the lines of a file.")

For our purposes, and so we can narrow things down a bit, let's stick to "oneshot" benchmarks:

$ rebar rank /tmp/results.csv --intersection -e oneshot
Engine                             Version                   Geometric mean of speed ratios  Benchmark count
------                             -------                   ------------------------------  ---------------
rust/memchr/memmem/oneshot         2.7.4                     1.12                            38
rust/std/memmem/oneshot            1.80.0-nightly b92758a9a  6.87                            38
rust/quick-strings/memmem/oneshot  0.0.1                     10.31                           38

Here is a breakdown on the level of the individual benchmark:

$ rebar cmp /tmp/results.csv --intersection -e oneshot
benchmark                                                   rust/memchr/memmem/oneshot  rust/quick-strings/memmem/oneshot  rust/std/memmem/oneshot
---------                                                   --------------------------  ---------------------------------  -----------------------
memmem/code/rust-library-never-fn-strength                  48.9 GB/s (1.00x)           1620.4 MB/s (30.87x)               1378.7 MB/s (36.28x)
memmem/code/rust-library-never-fn-strength-paren            48.3 GB/s (1.00x)           1637.2 MB/s (30.21x)               1482.8 MB/s (33.35x)
memmem/code/rust-library-never-fn-quux                      29.4 GB/s (1.00x)           1637.2 MB/s (18.42x)               1540.9 MB/s (19.57x)
memmem/code/rust-library-rare-fn-from-str                   48.9 GB/s (1.00x)           5.8 GB/s (8.40x)                   1496.9 MB/s (33.44x)
memmem/pathological/md5-huge-no-hash                        46.8 GB/s (1.00x)           1190.6 MB/s (40.27x)               8.7 GB/s (5.36x)
memmem/pathological/md5-huge-last-hash                      46.7 GB/s (1.00x)           1196.0 MB/s (39.95x)               10.5 GB/s (4.45x)
memmem/pathological/rare-repeated-huge-tricky               51.2 GB/s (1.00x)           1984.3 MB/s (26.44x)               1240.3 MB/s (42.30x)
memmem/pathological/rare-repeated-small-tricky              21.2 GB/s (1.00x)           1936.4 MB/s (11.20x)               1202.3 MB/s (18.05x)
memmem/pathological/defeat-simple-vector-alphabet           5.4 GB/s (1.00x)            1069.8 MB/s (5.18x)                4.4 GB/s (1.24x)
memmem/pathological/defeat-simple-vector-freq-alphabet      19.2 GB/s (1.37x)           1167.3 MB/s (23.04x)               26.3 GB/s (1.00x)
memmem/pathological/defeat-simple-vector-repeated-alphabet  991.6 MB/s (1.00x)          22.9 MB/s (43.36x)                 826.4 MB/s (1.20x)
memmem/subtitles/never/huge-en-john-watson                  39.6 GB/s (1.00x)           1930.5 MB/s (21.03x)               1605.2 MB/s (25.29x)
memmem/subtitles/never/huge-en-all-common-bytes             43.9 GB/s (1.00x)           1377.7 MB/s (32.63x)               2.9 GB/s (15.12x)
memmem/subtitles/never/huge-en-some-rare-bytes              50.3 GB/s (1.00x)           1969.1 MB/s (26.17x)               3.2 GB/s (15.52x)
memmem/subtitles/never/huge-en-two-space                    41.4 GB/s (1.00x)           727.8 MB/s (58.28x)                1158.8 MB/s (36.61x)
memmem/subtitles/never/teeny-en-john-watson                 1027.0 MB/s (1.44x)         1483.5 MB/s (1.00x)                920.8 MB/s (1.61x)
memmem/subtitles/never/teeny-en-all-common-bytes            989.0 MB/s (1.50x)          1483.5 MB/s (1.00x)                1027.0 MB/s (1.44x)
memmem/subtitles/never/teeny-en-some-rare-bytes             1027.0 MB/s (1.30x)         1335.1 MB/s (1.00x)                1161.0 MB/s (1.15x)
memmem/subtitles/never/teeny-en-two-space                   989.0 MB/s (1.12x)          1112.6 MB/s (1.00x)                953.7 MB/s (1.17x)
memmem/subtitles/never/huge-ru-john-watson                  51.3 GB/s (1.00x)           475.6 MB/s (110.41x)               2.2 GB/s (23.42x)
memmem/subtitles/never/teeny-ru-john-watson                 1178.1 MB/s (1.42x)         1668.9 MB/s (1.00x)                1001.4 MB/s (1.67x)
memmem/subtitles/never/huge-zh-john-watson                  45.1 GB/s (1.00x)           1457.1 MB/s (31.69x)               6.5 GB/s (6.90x)
memmem/subtitles/never/teeny-zh-john-watson                 1095.0 MB/s (1.50x)         1642.4 MB/s (1.00x)                953.7 MB/s (1.72x)
memmem/subtitles/rare/huge-en-sherlock-holmes               51.2 GB/s (1.00x)           1907.4 MB/s (27.48x)               2.5 GB/s (20.36x)
memmem/subtitles/rare/huge-en-sherlock                      37.9 GB/s (1.00x)           1907.0 MB/s (20.33x)               2.7 GB/s (14.11x)
memmem/subtitles/rare/huge-en-medium-needle                 48.0 GB/s (1.00x)           1358.2 MB/s (36.22x)               3.9 GB/s (12.27x)
memmem/subtitles/rare/huge-en-long-needle                   43.2 GB/s (1.00x)           1715.0 MB/s (25.78x)               6.8 GB/s (6.32x)
memmem/subtitles/rare/huge-en-huge-needle                   41.8 GB/s (1.00x)           1915.9 MB/s (22.37x)               7.0 GB/s (5.97x)
memmem/subtitles/rare/teeny-en-sherlock-holmes              989.0 MB/s (1.35x)          1335.1 MB/s (1.00x)                387.0 MB/s (3.45x)
memmem/subtitles/rare/teeny-en-sherlock                     1112.6 MB/s (1.33x)         1483.5 MB/s (1.00x)                606.9 MB/s (2.44x)
memmem/subtitles/rare/huge-ru-sherlock-holmes               40.1 GB/s (1.00x)           479.5 MB/s (85.61x)                2.6 GB/s (15.64x)
memmem/subtitles/rare/huge-ru-sherlock                      51.1 GB/s (1.00x)           479.5 MB/s (109.22x)               1733.9 MB/s (30.21x)
memmem/subtitles/rare/teeny-ru-sherlock-holmes              1112.6 MB/s (1.29x)         1430.5 MB/s (1.00x)                408.7 MB/s (3.50x)
memmem/subtitles/rare/teeny-ru-sherlock                     1292.1 MB/s (1.29x)         1668.9 MB/s (1.00x)                656.6 MB/s (2.54x)
memmem/subtitles/rare/huge-zh-sherlock-holmes               49.0 GB/s (1.00x)           1080.8 MB/s (46.38x)               4.9 GB/s (9.92x)
memmem/subtitles/rare/huge-zh-sherlock                      49.9 GB/s (1.00x)           1075.2 MB/s (47.56x)               3.4 GB/s (14.73x)
memmem/subtitles/rare/teeny-zh-sherlock-holmes              985.5 MB/s (1.50x)          1478.2 MB/s (1.00x)                343.8 MB/s (4.30x)
memmem/subtitles/rare/teeny-zh-sherlock                     1182.6 MB/s (1.56x)         1847.7 MB/s (1.00x)                579.7 MB/s (3.19x)

Even in comparison to memchr, which has been optimized like crazy, quick-strings actually does better on some benchmarks! But... if we ask for only results where there is, say, a very large difference, watch what happens:

$ rebar cmp /tmp/results.csv --intersection -e oneshot -t 5
benchmark                                                   rust/memchr/memmem/oneshot  rust/quick-strings/memmem/oneshot  rust/std/memmem/oneshot
---------                                                   --------------------------  ---------------------------------  -----------------------
memmem/code/rust-library-never-fn-strength                  48.9 GB/s (1.00x)           1620.4 MB/s (30.87x)               1378.7 MB/s (36.28x)
memmem/code/rust-library-never-fn-strength-paren            48.3 GB/s (1.00x)           1637.2 MB/s (30.21x)               1482.8 MB/s (33.35x)
memmem/code/rust-library-never-fn-quux                      29.4 GB/s (1.00x)           1637.2 MB/s (18.42x)               1540.9 MB/s (19.57x)
memmem/code/rust-library-rare-fn-from-str                   48.9 GB/s (1.00x)           5.8 GB/s (8.40x)                   1496.9 MB/s (33.44x)
memmem/pathological/md5-huge-no-hash                        46.8 GB/s (1.00x)           1190.6 MB/s (40.27x)               8.7 GB/s (5.36x)
memmem/pathological/md5-huge-last-hash                      46.7 GB/s (1.00x)           1196.0 MB/s (39.95x)               10.5 GB/s (4.45x)
memmem/pathological/rare-repeated-huge-tricky               51.2 GB/s (1.00x)           1984.3 MB/s (26.44x)               1240.3 MB/s (42.30x)
memmem/pathological/rare-repeated-small-tricky              21.2 GB/s (1.00x)           1936.4 MB/s (11.20x)               1202.3 MB/s (18.05x)
memmem/pathological/defeat-simple-vector-alphabet           5.4 GB/s (1.00x)            1069.8 MB/s (5.18x)                4.4 GB/s (1.24x)
memmem/pathological/defeat-simple-vector-freq-alphabet      19.2 GB/s (1.37x)           1167.3 MB/s (23.04x)               26.3 GB/s (1.00x)
memmem/pathological/defeat-simple-vector-repeated-alphabet  991.6 MB/s (1.00x)          22.9 MB/s (43.36x)                 826.4 MB/s (1.20x)
memmem/subtitles/never/huge-en-john-watson                  39.6 GB/s (1.00x)           1930.5 MB/s (21.03x)               1605.2 MB/s (25.29x)
memmem/subtitles/never/huge-en-all-common-bytes             43.9 GB/s (1.00x)           1377.7 MB/s (32.63x)               2.9 GB/s (15.12x)
memmem/subtitles/never/huge-en-some-rare-bytes              50.3 GB/s (1.00x)           1969.1 MB/s (26.17x)               3.2 GB/s (15.52x)
memmem/subtitles/never/huge-en-two-space                    41.4 GB/s (1.00x)           727.8 MB/s (58.28x)                1158.8 MB/s (36.61x)
memmem/subtitles/never/huge-ru-john-watson                  51.3 GB/s (1.00x)           475.6 MB/s (110.41x)               2.2 GB/s (23.42x)
memmem/subtitles/never/huge-zh-john-watson                  45.1 GB/s (1.00x)           1457.1 MB/s (31.69x)               6.5 GB/s (6.90x)
memmem/subtitles/rare/huge-en-sherlock-holmes               51.2 GB/s (1.00x)           1907.4 MB/s (27.48x)               2.5 GB/s (20.36x)
memmem/subtitles/rare/huge-en-sherlock                      37.9 GB/s (1.00x)           1907.0 MB/s (20.33x)               2.7 GB/s (14.11x)
memmem/subtitles/rare/huge-en-medium-needle                 48.0 GB/s (1.00x)           1358.2 MB/s (36.22x)               3.9 GB/s (12.27x)
memmem/subtitles/rare/huge-en-long-needle                   43.2 GB/s (1.00x)           1715.0 MB/s (25.78x)               6.8 GB/s (6.32x)
memmem/subtitles/rare/huge-en-huge-needle                   41.8 GB/s (1.00x)           1915.9 MB/s (22.37x)               7.0 GB/s (5.97x)
memmem/subtitles/rare/huge-ru-sherlock-holmes               40.1 GB/s (1.00x)           479.5 MB/s (85.61x)                2.6 GB/s (15.64x)
memmem/subtitles/rare/huge-ru-sherlock                      51.1 GB/s (1.00x)           479.5 MB/s (109.22x)               1733.9 MB/s (30.21x)
memmem/subtitles/rare/huge-zh-sherlock-holmes               49.0 GB/s (1.00x)           1080.8 MB/s (46.38x)               4.9 GB/s (9.92x)
memmem/subtitles/rare/huge-zh-sherlock                      49.9 GB/s (1.00x)           1075.2 MB/s (47.56x)               3.4 GB/s (14.73x)

In every case, quick-strings is the one that is slower. Here's another way to slice the data. Let's only look at benchmarks with non-teeny haystacks:

[andrew@duff memchr]$ rebar cmp /tmp/results.csv --intersection -e oneshot -F teeny
benchmark                                                   rust/memchr/memmem/oneshot  rust/quick-strings/memmem/oneshot  rust/std/memmem/oneshot
---------                                                   --------------------------  ---------------------------------  -----------------------
memmem/code/rust-library-never-fn-strength                  48.9 GB/s (1.00x)           1620.4 MB/s (30.87x)               1378.7 MB/s (36.28x)
memmem/code/rust-library-never-fn-strength-paren            48.3 GB/s (1.00x)           1637.2 MB/s (30.21x)               1482.8 MB/s (33.35x)
memmem/code/rust-library-never-fn-quux                      29.4 GB/s (1.00x)           1637.2 MB/s (18.42x)               1540.9 MB/s (19.57x)
memmem/code/rust-library-rare-fn-from-str                   48.9 GB/s (1.00x)           5.8 GB/s (8.40x)                   1496.9 MB/s (33.44x)
memmem/pathological/md5-huge-no-hash                        46.8 GB/s (1.00x)           1190.6 MB/s (40.27x)               8.7 GB/s (5.36x)
memmem/pathological/md5-huge-last-hash                      46.7 GB/s (1.00x)           1196.0 MB/s (39.95x)               10.5 GB/s (4.45x)
memmem/pathological/rare-repeated-huge-tricky               51.2 GB/s (1.00x)           1984.3 MB/s (26.44x)               1240.3 MB/s (42.30x)
memmem/pathological/rare-repeated-small-tricky              21.2 GB/s (1.00x)           1936.4 MB/s (11.20x)               1202.3 MB/s (18.05x)
memmem/pathological/defeat-simple-vector-alphabet           5.4 GB/s (1.00x)            1069.8 MB/s (5.18x)                4.4 GB/s (1.24x)
memmem/pathological/defeat-simple-vector-freq-alphabet      19.2 GB/s (1.37x)           1167.3 MB/s (23.04x)               26.3 GB/s (1.00x)
memmem/pathological/defeat-simple-vector-repeated-alphabet  991.6 MB/s (1.00x)          22.9 MB/s (43.36x)                 826.4 MB/s (1.20x)
memmem/subtitles/never/huge-en-john-watson                  39.6 GB/s (1.00x)           1930.5 MB/s (21.03x)               1605.2 MB/s (25.29x)
memmem/subtitles/never/huge-en-all-common-bytes             43.9 GB/s (1.00x)           1377.7 MB/s (32.63x)               2.9 GB/s (15.12x)
memmem/subtitles/never/huge-en-some-rare-bytes              50.3 GB/s (1.00x)           1969.1 MB/s (26.17x)               3.2 GB/s (15.52x)
memmem/subtitles/never/huge-en-two-space                    41.4 GB/s (1.00x)           727.8 MB/s (58.28x)                1158.8 MB/s (36.61x)
memmem/subtitles/never/huge-ru-john-watson                  51.3 GB/s (1.00x)           475.6 MB/s (110.41x)               2.2 GB/s (23.42x)
memmem/subtitles/never/huge-zh-john-watson                  45.1 GB/s (1.00x)           1457.1 MB/s (31.69x)               6.5 GB/s (6.90x)
memmem/subtitles/rare/huge-en-sherlock-holmes               51.2 GB/s (1.00x)           1907.4 MB/s (27.48x)               2.5 GB/s (20.36x)
memmem/subtitles/rare/huge-en-sherlock                      37.9 GB/s (1.00x)           1907.0 MB/s (20.33x)               2.7 GB/s (14.11x)
memmem/subtitles/rare/huge-en-medium-needle                 48.0 GB/s (1.00x)           1358.2 MB/s (36.22x)               3.9 GB/s (12.27x)
memmem/subtitles/rare/huge-en-long-needle                   43.2 GB/s (1.00x)           1715.0 MB/s (25.78x)               6.8 GB/s (6.32x)
memmem/subtitles/rare/huge-en-huge-needle                   41.8 GB/s (1.00x)           1915.9 MB/s (22.37x)               7.0 GB/s (5.97x)
memmem/subtitles/rare/huge-ru-sherlock-holmes               40.1 GB/s (1.00x)           479.5 MB/s (85.61x)                2.6 GB/s (15.64x)
memmem/subtitles/rare/huge-ru-sherlock                      51.1 GB/s (1.00x)           479.5 MB/s (109.22x)               1733.9 MB/s (30.21x)
memmem/subtitles/rare/huge-zh-sherlock-holmes               49.0 GB/s (1.00x)           1080.8 MB/s (46.38x)               4.9 GB/s (9.92x)
memmem/subtitles/rare/huge-zh-sherlock                      49.9 GB/s (1.00x)           1075.2 MB/s (47.56x)               3.4 GB/s (14.73x)

Again, quick-strings is slower in every case. Now let's look at benchmarks with only teeny haystacks (the opposite of the above):

$ rebar cmp /tmp/results.csv --intersection -e oneshot -f teeny
benchmark                                         rust/memchr/memmem/oneshot  rust/quick-strings/memmem/oneshot  rust/std/memmem/oneshot
---------                                         --------------------------  ---------------------------------  -----------------------
memmem/subtitles/never/teeny-en-john-watson       1027.0 MB/s (1.44x)         1483.5 MB/s (1.00x)                920.8 MB/s (1.61x)
memmem/subtitles/never/teeny-en-all-common-bytes  989.0 MB/s (1.50x)          1483.5 MB/s (1.00x)                1027.0 MB/s (1.44x)
memmem/subtitles/never/teeny-en-some-rare-bytes   1027.0 MB/s (1.30x)         1335.1 MB/s (1.00x)                1161.0 MB/s (1.15x)
memmem/subtitles/never/teeny-en-two-space         989.0 MB/s (1.12x)          1112.6 MB/s (1.00x)                953.7 MB/s (1.17x)
memmem/subtitles/never/teeny-ru-john-watson       1178.1 MB/s (1.42x)         1668.9 MB/s (1.00x)                1001.4 MB/s (1.67x)
memmem/subtitles/never/teeny-zh-john-watson       1095.0 MB/s (1.50x)         1642.4 MB/s (1.00x)                953.7 MB/s (1.72x)
memmem/subtitles/rare/teeny-en-sherlock-holmes    989.0 MB/s (1.35x)          1335.1 MB/s (1.00x)                387.0 MB/s (3.45x)
memmem/subtitles/rare/teeny-en-sherlock           1112.6 MB/s (1.33x)         1483.5 MB/s (1.00x)                606.9 MB/s (2.44x)
memmem/subtitles/rare/teeny-ru-sherlock-holmes    1112.6 MB/s (1.29x)         1430.5 MB/s (1.00x)                408.7 MB/s (3.50x)
memmem/subtitles/rare/teeny-ru-sherlock           1292.1 MB/s (1.29x)         1668.9 MB/s (1.00x)                656.6 MB/s (2.54x)
memmem/subtitles/rare/teeny-zh-sherlock-holmes    985.5 MB/s (1.50x)          1478.2 MB/s (1.00x)                343.8 MB/s (4.30x)
memmem/subtitles/rare/teeny-zh-sherlock           1182.6 MB/s (1.56x)         1847.7 MB/s (1.00x)                579.7 MB/s (3.19x)

In this case, quick-strings is actually faster in every case!

My contextualization of these results is that quick-strings implements a "naive" substring search algorithm that has worst case O(m * n) time but very low startup costs. This means that if you pick your benchmarks correctly, it can appear to be a lot faster in some common cases. But once you extend beyond that or get large haystacks, perf can really tank pretty considerably.

Substring search implementations have a lot of different cases to cover. memchr in particular does try very hard to optimize for the kinds of cases where quick-strings is fast, and this is likely why it fairs a bit better than std does above on the teeny haystacks. (For example, memchr is only 1.5x slower than quick-strings on teeny-zh-sherlock-holmes, but std is 4.3x slower.)

The reason why it's difficult to get as fast as quick-strings (or pretty much any naive implementation) is that there's a bunch of startup overhead to pay for in a substring search implementation that uses multiple different strategies. For example, memchr will use SIMD in some cases, but only when the CPU supports it (via runtime querying). And memchr will change which implementation it uses based on haystack length. So there is some amount of case analysis being done before the search can even begin to execute. In contrast, quick-strings can just launch right into the search.

With that said, there's always room for improvement! memchr isn't necessarily at the floor of an otherwise irreducible trade-off, but I think that so long as you want to get high throughput for big haystacks (like what ripgrep really wants to do), you probably need to at least partially sell out some other types of tasks for very very small haystacks.

BurntSushi avatar Jul 22 '24 19:07 BurntSushi