memchr icon indicating copy to clipboard operation
memchr copied to clipboard

benchmarks: add ad hoc benchmark for 'quick-strings'

Open BurntSushi opened this issue 1 year ago • 8 comments

This is in response to @samuelcolvin's tweet about Rust's standard library substring search routines being slow. This PR hooks up his naive implementation of substring search to memchr's rebar harness and bakes them off against the substring search implementations in std and memchr across a wide variety of benchmarks.

I ran the benchmarks like so:

$ rebar build
$ rebar measure \
    -e quick-strings \
    -e rust/std/memmem/oneshot \
    -e rust/std/memmem/prebuilt \
    -e rust/memchr/memmem/oneshot \
    -e rust/memchr/memmem/prebuilt | tee /tmp/results.csv

And here is a high level comparison using the geometric mean of relative speed ratios across all measured benchmarks (limited only to those on which all of std's, memchr's and quick-string's code is measured):

$ rebar rank --intersection /tmp/results.csv
Engine                             Version                   Geometric mean of speed ratios  Benchmark count
------                             -------                   ------------------------------  ---------------
rust/memchr/memmem/prebuilt        2.7.4                     1.05                            38
rust/memchr/memmem/oneshot         2.7.4                     1.22                            38
rust/std/memmem/prebuilt           1.80.0-nightly b92758a9a  7.02                            38
rust/std/memmem/oneshot            1.80.0-nightly b92758a9a  7.48                            38
rust/quick-strings/memmem/oneshot  0.0.1                     11.22                           38

The "oneshot" benchmarks are the closest apples-to-apples comparison. Namely, they measure the time it takes to do a search of a needle in a haystack with a single function call. The "prebuilt" benchmarks, on the other hand, are allowed to accept a needle, build a searcher and then reuse that search for subsequent use. The "prebuilt" benchmarks demonstrate the advantages of a superior API when the needle is invariant. (For example, "find all occurrences of this needle in the lines of a file.")

For our purposes, and so we can narrow things down a bit, let's stick to "oneshot" benchmarks:

$ rebar rank /tmp/results.csv --intersection -e oneshot
Engine                             Version                   Geometric mean of speed ratios  Benchmark count
------                             -------                   ------------------------------  ---------------
rust/memchr/memmem/oneshot         2.7.4                     1.12                            38
rust/std/memmem/oneshot            1.80.0-nightly b92758a9a  6.87                            38
rust/quick-strings/memmem/oneshot  0.0.1                     10.31                           38

Here is a breakdown on the level of the individual benchmark:

$ rebar cmp /tmp/results.csv --intersection -e oneshot
benchmark                                                   rust/memchr/memmem/oneshot  rust/quick-strings/memmem/oneshot  rust/std/memmem/oneshot
---------                                                   --------------------------  ---------------------------------  -----------------------
memmem/code/rust-library-never-fn-strength                  48.9 GB/s (1.00x)           1620.4 MB/s (30.87x)               1378.7 MB/s (36.28x)
memmem/code/rust-library-never-fn-strength-paren            48.3 GB/s (1.00x)           1637.2 MB/s (30.21x)               1482.8 MB/s (33.35x)
memmem/code/rust-library-never-fn-quux                      29.4 GB/s (1.00x)           1637.2 MB/s (18.42x)               1540.9 MB/s (19.57x)
memmem/code/rust-library-rare-fn-from-str                   48.9 GB/s (1.00x)           5.8 GB/s (8.40x)                   1496.9 MB/s (33.44x)
memmem/pathological/md5-huge-no-hash                        46.8 GB/s (1.00x)           1190.6 MB/s (40.27x)               8.7 GB/s (5.36x)
memmem/pathological/md5-huge-last-hash                      46.7 GB/s (1.00x)           1196.0 MB/s (39.95x)               10.5 GB/s (4.45x)
memmem/pathological/rare-repeated-huge-tricky               51.2 GB/s (1.00x)           1984.3 MB/s (26.44x)               1240.3 MB/s (42.30x)
memmem/pathological/rare-repeated-small-tricky              21.2 GB/s (1.00x)           1936.4 MB/s (11.20x)               1202.3 MB/s (18.05x)
memmem/pathological/defeat-simple-vector-alphabet           5.4 GB/s (1.00x)            1069.8 MB/s (5.18x)                4.4 GB/s (1.24x)
memmem/pathological/defeat-simple-vector-freq-alphabet      19.2 GB/s (1.37x)           1167.3 MB/s (23.04x)               26.3 GB/s (1.00x)
memmem/pathological/defeat-simple-vector-repeated-alphabet  991.6 MB/s (1.00x)          22.9 MB/s (43.36x)                 826.4 MB/s (1.20x)
memmem/subtitles/never/huge-en-john-watson                  39.6 GB/s (1.00x)           1930.5 MB/s (21.03x)               1605.2 MB/s (25.29x)
memmem/subtitles/never/huge-en-all-common-bytes             43.9 GB/s (1.00x)           1377.7 MB/s (32.63x)               2.9 GB/s (15.12x)
memmem/subtitles/never/huge-en-some-rare-bytes              50.3 GB/s (1.00x)           1969.1 MB/s (26.17x)               3.2 GB/s (15.52x)
memmem/subtitles/never/huge-en-two-space                    41.4 GB/s (1.00x)           727.8 MB/s (58.28x)                1158.8 MB/s (36.61x)
memmem/subtitles/never/teeny-en-john-watson                 1027.0 MB/s (1.44x)         1483.5 MB/s (1.00x)                920.8 MB/s (1.61x)
memmem/subtitles/never/teeny-en-all-common-bytes            989.0 MB/s (1.50x)          1483.5 MB/s (1.00x)                1027.0 MB/s (1.44x)
memmem/subtitles/never/teeny-en-some-rare-bytes             1027.0 MB/s (1.30x)         1335.1 MB/s (1.00x)                1161.0 MB/s (1.15x)
memmem/subtitles/never/teeny-en-two-space                   989.0 MB/s (1.12x)          1112.6 MB/s (1.00x)                953.7 MB/s (1.17x)
memmem/subtitles/never/huge-ru-john-watson                  51.3 GB/s (1.00x)           475.6 MB/s (110.41x)               2.2 GB/s (23.42x)
memmem/subtitles/never/teeny-ru-john-watson                 1178.1 MB/s (1.42x)         1668.9 MB/s (1.00x)                1001.4 MB/s (1.67x)
memmem/subtitles/never/huge-zh-john-watson                  45.1 GB/s (1.00x)           1457.1 MB/s (31.69x)               6.5 GB/s (6.90x)
memmem/subtitles/never/teeny-zh-john-watson                 1095.0 MB/s (1.50x)         1642.4 MB/s (1.00x)                953.7 MB/s (1.72x)
memmem/subtitles/rare/huge-en-sherlock-holmes               51.2 GB/s (1.00x)           1907.4 MB/s (27.48x)               2.5 GB/s (20.36x)
memmem/subtitles/rare/huge-en-sherlock                      37.9 GB/s (1.00x)           1907.0 MB/s (20.33x)               2.7 GB/s (14.11x)
memmem/subtitles/rare/huge-en-medium-needle                 48.0 GB/s (1.00x)           1358.2 MB/s (36.22x)               3.9 GB/s (12.27x)
memmem/subtitles/rare/huge-en-long-needle                   43.2 GB/s (1.00x)           1715.0 MB/s (25.78x)               6.8 GB/s (6.32x)
memmem/subtitles/rare/huge-en-huge-needle                   41.8 GB/s (1.00x)           1915.9 MB/s (22.37x)               7.0 GB/s (5.97x)
memmem/subtitles/rare/teeny-en-sherlock-holmes              989.0 MB/s (1.35x)          1335.1 MB/s (1.00x)                387.0 MB/s (3.45x)
memmem/subtitles/rare/teeny-en-sherlock                     1112.6 MB/s (1.33x)         1483.5 MB/s (1.00x)                606.9 MB/s (2.44x)
memmem/subtitles/rare/huge-ru-sherlock-holmes               40.1 GB/s (1.00x)           479.5 MB/s (85.61x)                2.6 GB/s (15.64x)
memmem/subtitles/rare/huge-ru-sherlock                      51.1 GB/s (1.00x)           479.5 MB/s (109.22x)               1733.9 MB/s (30.21x)
memmem/subtitles/rare/teeny-ru-sherlock-holmes              1112.6 MB/s (1.29x)         1430.5 MB/s (1.00x)                408.7 MB/s (3.50x)
memmem/subtitles/rare/teeny-ru-sherlock                     1292.1 MB/s (1.29x)         1668.9 MB/s (1.00x)                656.6 MB/s (2.54x)
memmem/subtitles/rare/huge-zh-sherlock-holmes               49.0 GB/s (1.00x)           1080.8 MB/s (46.38x)               4.9 GB/s (9.92x)
memmem/subtitles/rare/huge-zh-sherlock                      49.9 GB/s (1.00x)           1075.2 MB/s (47.56x)               3.4 GB/s (14.73x)
memmem/subtitles/rare/teeny-zh-sherlock-holmes              985.5 MB/s (1.50x)          1478.2 MB/s (1.00x)                343.8 MB/s (4.30x)
memmem/subtitles/rare/teeny-zh-sherlock                     1182.6 MB/s (1.56x)         1847.7 MB/s (1.00x)                579.7 MB/s (3.19x)

Even in comparison to memchr, which has been optimized like crazy, quick-strings actually does better on some benchmarks! But... if we ask for only results where there is, say, a very large difference, watch what happens:

$ rebar cmp /tmp/results.csv --intersection -e oneshot -t 5
benchmark                                                   rust/memchr/memmem/oneshot  rust/quick-strings/memmem/oneshot  rust/std/memmem/oneshot
---------                                                   --------------------------  ---------------------------------  -----------------------
memmem/code/rust-library-never-fn-strength                  48.9 GB/s (1.00x)           1620.4 MB/s (30.87x)               1378.7 MB/s (36.28x)
memmem/code/rust-library-never-fn-strength-paren            48.3 GB/s (1.00x)           1637.2 MB/s (30.21x)               1482.8 MB/s (33.35x)
memmem/code/rust-library-never-fn-quux                      29.4 GB/s (1.00x)           1637.2 MB/s (18.42x)               1540.9 MB/s (19.57x)
memmem/code/rust-library-rare-fn-from-str                   48.9 GB/s (1.00x)           5.8 GB/s (8.40x)                   1496.9 MB/s (33.44x)
memmem/pathological/md5-huge-no-hash                        46.8 GB/s (1.00x)           1190.6 MB/s (40.27x)               8.7 GB/s (5.36x)
memmem/pathological/md5-huge-last-hash                      46.7 GB/s (1.00x)           1196.0 MB/s (39.95x)               10.5 GB/s (4.45x)
memmem/pathological/rare-repeated-huge-tricky               51.2 GB/s (1.00x)           1984.3 MB/s (26.44x)               1240.3 MB/s (42.30x)
memmem/pathological/rare-repeated-small-tricky              21.2 GB/s (1.00x)           1936.4 MB/s (11.20x)               1202.3 MB/s (18.05x)
memmem/pathological/defeat-simple-vector-alphabet           5.4 GB/s (1.00x)            1069.8 MB/s (5.18x)                4.4 GB/s (1.24x)
memmem/pathological/defeat-simple-vector-freq-alphabet      19.2 GB/s (1.37x)           1167.3 MB/s (23.04x)               26.3 GB/s (1.00x)
memmem/pathological/defeat-simple-vector-repeated-alphabet  991.6 MB/s (1.00x)          22.9 MB/s (43.36x)                 826.4 MB/s (1.20x)
memmem/subtitles/never/huge-en-john-watson                  39.6 GB/s (1.00x)           1930.5 MB/s (21.03x)               1605.2 MB/s (25.29x)
memmem/subtitles/never/huge-en-all-common-bytes             43.9 GB/s (1.00x)           1377.7 MB/s (32.63x)               2.9 GB/s (15.12x)
memmem/subtitles/never/huge-en-some-rare-bytes              50.3 GB/s (1.00x)           1969.1 MB/s (26.17x)               3.2 GB/s (15.52x)
memmem/subtitles/never/huge-en-two-space                    41.4 GB/s (1.00x)           727.8 MB/s (58.28x)                1158.8 MB/s (36.61x)
memmem/subtitles/never/huge-ru-john-watson                  51.3 GB/s (1.00x)           475.6 MB/s (110.41x)               2.2 GB/s (23.42x)
memmem/subtitles/never/huge-zh-john-watson                  45.1 GB/s (1.00x)           1457.1 MB/s (31.69x)               6.5 GB/s (6.90x)
memmem/subtitles/rare/huge-en-sherlock-holmes               51.2 GB/s (1.00x)           1907.4 MB/s (27.48x)               2.5 GB/s (20.36x)
memmem/subtitles/rare/huge-en-sherlock                      37.9 GB/s (1.00x)           1907.0 MB/s (20.33x)               2.7 GB/s (14.11x)
memmem/subtitles/rare/huge-en-medium-needle                 48.0 GB/s (1.00x)           1358.2 MB/s (36.22x)               3.9 GB/s (12.27x)
memmem/subtitles/rare/huge-en-long-needle                   43.2 GB/s (1.00x)           1715.0 MB/s (25.78x)               6.8 GB/s (6.32x)
memmem/subtitles/rare/huge-en-huge-needle                   41.8 GB/s (1.00x)           1915.9 MB/s (22.37x)               7.0 GB/s (5.97x)
memmem/subtitles/rare/huge-ru-sherlock-holmes               40.1 GB/s (1.00x)           479.5 MB/s (85.61x)                2.6 GB/s (15.64x)
memmem/subtitles/rare/huge-ru-sherlock                      51.1 GB/s (1.00x)           479.5 MB/s (109.22x)               1733.9 MB/s (30.21x)
memmem/subtitles/rare/huge-zh-sherlock-holmes               49.0 GB/s (1.00x)           1080.8 MB/s (46.38x)               4.9 GB/s (9.92x)
memmem/subtitles/rare/huge-zh-sherlock                      49.9 GB/s (1.00x)           1075.2 MB/s (47.56x)               3.4 GB/s (14.73x)

In every case, quick-strings is the one that is slower. Here's another way to slice the data. Let's only look at benchmarks with non-teeny haystacks:

[andrew@duff memchr]$ rebar cmp /tmp/results.csv --intersection -e oneshot -F teeny
benchmark                                                   rust/memchr/memmem/oneshot  rust/quick-strings/memmem/oneshot  rust/std/memmem/oneshot
---------                                                   --------------------------  ---------------------------------  -----------------------
memmem/code/rust-library-never-fn-strength                  48.9 GB/s (1.00x)           1620.4 MB/s (30.87x)               1378.7 MB/s (36.28x)
memmem/code/rust-library-never-fn-strength-paren            48.3 GB/s (1.00x)           1637.2 MB/s (30.21x)               1482.8 MB/s (33.35x)
memmem/code/rust-library-never-fn-quux                      29.4 GB/s (1.00x)           1637.2 MB/s (18.42x)               1540.9 MB/s (19.57x)
memmem/code/rust-library-rare-fn-from-str                   48.9 GB/s (1.00x)           5.8 GB/s (8.40x)                   1496.9 MB/s (33.44x)
memmem/pathological/md5-huge-no-hash                        46.8 GB/s (1.00x)           1190.6 MB/s (40.27x)               8.7 GB/s (5.36x)
memmem/pathological/md5-huge-last-hash                      46.7 GB/s (1.00x)           1196.0 MB/s (39.95x)               10.5 GB/s (4.45x)
memmem/pathological/rare-repeated-huge-tricky               51.2 GB/s (1.00x)           1984.3 MB/s (26.44x)               1240.3 MB/s (42.30x)
memmem/pathological/rare-repeated-small-tricky              21.2 GB/s (1.00x)           1936.4 MB/s (11.20x)               1202.3 MB/s (18.05x)
memmem/pathological/defeat-simple-vector-alphabet           5.4 GB/s (1.00x)            1069.8 MB/s (5.18x)                4.4 GB/s (1.24x)
memmem/pathological/defeat-simple-vector-freq-alphabet      19.2 GB/s (1.37x)           1167.3 MB/s (23.04x)               26.3 GB/s (1.00x)
memmem/pathological/defeat-simple-vector-repeated-alphabet  991.6 MB/s (1.00x)          22.9 MB/s (43.36x)                 826.4 MB/s (1.20x)
memmem/subtitles/never/huge-en-john-watson                  39.6 GB/s (1.00x)           1930.5 MB/s (21.03x)               1605.2 MB/s (25.29x)
memmem/subtitles/never/huge-en-all-common-bytes             43.9 GB/s (1.00x)           1377.7 MB/s (32.63x)               2.9 GB/s (15.12x)
memmem/subtitles/never/huge-en-some-rare-bytes              50.3 GB/s (1.00x)           1969.1 MB/s (26.17x)               3.2 GB/s (15.52x)
memmem/subtitles/never/huge-en-two-space                    41.4 GB/s (1.00x)           727.8 MB/s (58.28x)                1158.8 MB/s (36.61x)
memmem/subtitles/never/huge-ru-john-watson                  51.3 GB/s (1.00x)           475.6 MB/s (110.41x)               2.2 GB/s (23.42x)
memmem/subtitles/never/huge-zh-john-watson                  45.1 GB/s (1.00x)           1457.1 MB/s (31.69x)               6.5 GB/s (6.90x)
memmem/subtitles/rare/huge-en-sherlock-holmes               51.2 GB/s (1.00x)           1907.4 MB/s (27.48x)               2.5 GB/s (20.36x)
memmem/subtitles/rare/huge-en-sherlock                      37.9 GB/s (1.00x)           1907.0 MB/s (20.33x)               2.7 GB/s (14.11x)
memmem/subtitles/rare/huge-en-medium-needle                 48.0 GB/s (1.00x)           1358.2 MB/s (36.22x)               3.9 GB/s (12.27x)
memmem/subtitles/rare/huge-en-long-needle                   43.2 GB/s (1.00x)           1715.0 MB/s (25.78x)               6.8 GB/s (6.32x)
memmem/subtitles/rare/huge-en-huge-needle                   41.8 GB/s (1.00x)           1915.9 MB/s (22.37x)               7.0 GB/s (5.97x)
memmem/subtitles/rare/huge-ru-sherlock-holmes               40.1 GB/s (1.00x)           479.5 MB/s (85.61x)                2.6 GB/s (15.64x)
memmem/subtitles/rare/huge-ru-sherlock                      51.1 GB/s (1.00x)           479.5 MB/s (109.22x)               1733.9 MB/s (30.21x)
memmem/subtitles/rare/huge-zh-sherlock-holmes               49.0 GB/s (1.00x)           1080.8 MB/s (46.38x)               4.9 GB/s (9.92x)
memmem/subtitles/rare/huge-zh-sherlock                      49.9 GB/s (1.00x)           1075.2 MB/s (47.56x)               3.4 GB/s (14.73x)

Again, quick-strings is slower in every case. Now let's look at benchmarks with only teeny haystacks (the opposite of the above):

$ rebar cmp /tmp/results.csv --intersection -e oneshot -f teeny
benchmark                                         rust/memchr/memmem/oneshot  rust/quick-strings/memmem/oneshot  rust/std/memmem/oneshot
---------                                         --------------------------  ---------------------------------  -----------------------
memmem/subtitles/never/teeny-en-john-watson       1027.0 MB/s (1.44x)         1483.5 MB/s (1.00x)                920.8 MB/s (1.61x)
memmem/subtitles/never/teeny-en-all-common-bytes  989.0 MB/s (1.50x)          1483.5 MB/s (1.00x)                1027.0 MB/s (1.44x)
memmem/subtitles/never/teeny-en-some-rare-bytes   1027.0 MB/s (1.30x)         1335.1 MB/s (1.00x)                1161.0 MB/s (1.15x)
memmem/subtitles/never/teeny-en-two-space         989.0 MB/s (1.12x)          1112.6 MB/s (1.00x)                953.7 MB/s (1.17x)
memmem/subtitles/never/teeny-ru-john-watson       1178.1 MB/s (1.42x)         1668.9 MB/s (1.00x)                1001.4 MB/s (1.67x)
memmem/subtitles/never/teeny-zh-john-watson       1095.0 MB/s (1.50x)         1642.4 MB/s (1.00x)                953.7 MB/s (1.72x)
memmem/subtitles/rare/teeny-en-sherlock-holmes    989.0 MB/s (1.35x)          1335.1 MB/s (1.00x)                387.0 MB/s (3.45x)
memmem/subtitles/rare/teeny-en-sherlock           1112.6 MB/s (1.33x)         1483.5 MB/s (1.00x)                606.9 MB/s (2.44x)
memmem/subtitles/rare/teeny-ru-sherlock-holmes    1112.6 MB/s (1.29x)         1430.5 MB/s (1.00x)                408.7 MB/s (3.50x)
memmem/subtitles/rare/teeny-ru-sherlock           1292.1 MB/s (1.29x)         1668.9 MB/s (1.00x)                656.6 MB/s (2.54x)
memmem/subtitles/rare/teeny-zh-sherlock-holmes    985.5 MB/s (1.50x)          1478.2 MB/s (1.00x)                343.8 MB/s (4.30x)
memmem/subtitles/rare/teeny-zh-sherlock           1182.6 MB/s (1.56x)         1847.7 MB/s (1.00x)                579.7 MB/s (3.19x)

In this case, quick-strings is actually faster in every case!

My contextualization of these results is that quick-strings implements a "naive" substring search algorithm that has worst case O(m * n) time but very low startup costs. This means that if you pick your benchmarks correctly, it can appear to be a lot faster in some common cases. But once you extend beyond that or get large haystacks, perf can really tank pretty considerably.

Substring search implementations have a lot of different cases to cover. memchr in particular does try very hard to optimize for the kinds of cases where quick-strings is fast, and this is likely why it fairs a bit better than std does above on the teeny haystacks. (For example, memchr is only 1.5x slower than quick-strings on teeny-zh-sherlock-holmes, but std is 4.3x slower.)

The reason why it's difficult to get as fast as quick-strings (or pretty much any naive implementation) is that there's a bunch of startup overhead to pay for in a substring search implementation that uses multiple different strategies. For example, memchr will use SIMD in some cases, but only when the CPU supports it (via runtime querying). And memchr will change which implementation it uses based on haystack length. So there is some amount of case analysis being done before the search can even begin to execute. In contrast, quick-strings can just launch right into the search.

With that said, there's always room for improvement! memchr isn't necessarily at the floor of an otherwise irreducible trade-off, but I think that so long as you want to get high throughput for big haystacks (like what ripgrep really wants to do), you probably need to at least partially sell out some other types of tasks for very very small haystacks.

BurntSushi avatar Jul 22 '24 19:07 BurntSushi

This is fascinating, thank you so much.

I'll look now soon.

Perhaps we can just use memchr for arrow-rs.

It's worth noting that wrt LIKE and ILIKE queries:

  1. Most worries have quite a short needle
  2. 99.x% of rows will often be misses, so the cost of a match doesn't matter much.

samuelcolvin avatar Jul 22 '24 21:07 samuelcolvin

memchr would be a good bet. If you're doing LIKE-style queries, then you should be able to build a memchr::memmem::Finder once and reuse it. And if you do that, it does much better on the short haystacks cases, even when comparing with quick-strings:

[andrew@duff memchr]$ rebar cmp /tmp/results.csv --intersection -e quick-strings -e memchr/memmem/prebuilt
benchmark                                                   rust/memchr/memmem/prebuilt  rust/quick-strings/memmem/oneshot
---------                                                   ---------------------------  ---------------------------------
memmem/code/rust-library-never-fn-strength                  49.0 GB/s (1.00x)            1620.4 MB/s (30.94x)
memmem/code/rust-library-never-fn-strength-paren            48.9 GB/s (1.00x)            1637.2 MB/s (30.59x)
memmem/code/rust-library-never-fn-quux                      36.5 GB/s (1.00x)            1637.2 MB/s (22.86x)
memmem/code/rust-library-rare-fn-from-str                   48.9 GB/s (1.00x)            5.8 GB/s (8.40x)
memmem/pathological/md5-huge-no-hash                        40.3 GB/s (1.00x)            1190.6 MB/s (34.63x)
memmem/pathological/md5-huge-last-hash                      47.3 GB/s (1.00x)            1196.0 MB/s (40.49x)
memmem/pathological/rare-repeated-huge-tricky               51.3 GB/s (1.00x)            1984.3 MB/s (26.47x)
memmem/pathological/rare-repeated-small-tricky              27.4 GB/s (1.00x)            1936.4 MB/s (14.50x)
memmem/pathological/defeat-simple-vector-alphabet           5.4 GB/s (1.00x)             1069.8 MB/s (5.18x)
memmem/pathological/defeat-simple-vector-freq-alphabet      19.2 GB/s (1.00x)            1167.3 MB/s (16.87x)
memmem/pathological/defeat-simple-vector-repeated-alphabet  991.9 MB/s (1.00x)           22.9 MB/s (43.38x)
memmem/subtitles/never/huge-en-john-watson                  40.2 GB/s (1.00x)            1930.5 MB/s (21.34x)
memmem/subtitles/never/huge-en-all-common-bytes             45.6 GB/s (1.00x)            1377.7 MB/s (33.91x)
memmem/subtitles/never/huge-en-some-rare-bytes              49.2 GB/s (1.00x)            1969.1 MB/s (25.59x)
memmem/subtitles/never/huge-en-two-space                    41.6 GB/s (1.00x)            727.8 MB/s (58.53x)
memmem/subtitles/never/teeny-en-john-watson                 1668.9 MB/s (1.00x)          1483.5 MB/s (1.12x)
memmem/subtitles/never/teeny-en-all-common-bytes            1668.9 MB/s (1.00x)          1483.5 MB/s (1.12x)
memmem/subtitles/never/teeny-en-some-rare-bytes             1668.9 MB/s (1.00x)          1335.1 MB/s (1.25x)
memmem/subtitles/never/teeny-en-two-space                   1668.9 MB/s (1.00x)          1112.6 MB/s (1.50x)
memmem/subtitles/never/huge-ru-john-watson                  40.3 GB/s (1.00x)            475.6 MB/s (86.68x)
memmem/subtitles/never/teeny-ru-john-watson                 2.6 GB/s (1.00x)             1668.9 MB/s (1.60x)
memmem/subtitles/never/huge-zh-john-watson                  49.8 GB/s (1.00x)            1457.1 MB/s (35.00x)
memmem/subtitles/never/teeny-zh-john-watson                 1847.7 MB/s (1.00x)          1642.4 MB/s (1.12x)
memmem/subtitles/rare/huge-en-sherlock-holmes               51.4 GB/s (1.00x)            1907.4 MB/s (27.58x)
memmem/subtitles/rare/huge-en-sherlock                      50.8 GB/s (1.00x)            1907.0 MB/s (27.26x)
memmem/subtitles/rare/huge-en-medium-needle                 36.9 GB/s (1.00x)            1358.2 MB/s (27.84x)
memmem/subtitles/rare/huge-en-long-needle                   43.2 GB/s (1.00x)            1715.0 MB/s (25.82x)
memmem/subtitles/rare/huge-en-huge-needle                   39.3 GB/s (1.00x)            1915.9 MB/s (21.03x)
memmem/subtitles/rare/teeny-en-sherlock-holmes              1570.8 MB/s (1.00x)          1335.1 MB/s (1.18x)
memmem/subtitles/rare/teeny-en-sherlock                     1335.1 MB/s (1.11x)          1483.5 MB/s (1.00x)
memmem/subtitles/rare/huge-ru-sherlock-holmes               50.3 GB/s (1.00x)            479.5 MB/s (107.49x)
memmem/subtitles/rare/huge-ru-sherlock                      51.2 GB/s (1.00x)            479.5 MB/s (109.32x)
memmem/subtitles/rare/teeny-ru-sherlock-holmes              2.1 GB/s (1.00x)             1430.5 MB/s (1.47x)
memmem/subtitles/rare/teeny-ru-sherlock                     1741.5 MB/s (1.00x)          1668.9 MB/s (1.04x)
memmem/subtitles/rare/huge-zh-sherlock-holmes               49.1 GB/s (1.00x)            1080.8 MB/s (46.50x)
memmem/subtitles/rare/huge-zh-sherlock                      50.0 GB/s (1.00x)            1075.2 MB/s (47.60x)
memmem/subtitles/rare/teeny-zh-sherlock-holmes              1137.1 MB/s (1.30x)          1478.2 MB/s (1.00x)
memmem/subtitles/rare/teeny-zh-sherlock                     1231.8 MB/s (1.50x)          1847.7 MB/s (1.00x)

std doesn't expose any sort of analogous API for a prebuilt searcher based on just the needle.

That said, memchr doesn't do case insensitive search. That's a different can of worms, especially if you care about Unicode support. A regex::Regex supports it, but there's a lot more machinery there and its overhead is even bigger than memchr::memmem. If you only need ASCII case insensitivity, then there is AhoCorasickBuilder::ascii_case_insensitive, although I don't believe that's had a ton of optimization work put into it, so YMMV.

BurntSushi avatar Jul 22 '24 22:07 BurntSushi

Thanks @BurntSushi,

That said, memchr doesn't do case insensitive search.

Here arrow-rs already falls back to using regex::Regex, the idea for quick-strings is for the very common cases where a LIKE or ILIKE query can be represented as starts_with, ends_with, contains, istarts_with, iends_with or icontains.

For that reason, we only need ascii case insensitive methods — if the needle is case insensitive, we use that methods, otherwise we can just fallback to using regex::Regex.

Ascii needles make up (in my experience) a very large proportion of queries, hence the value in optimising that case.

Is there any interest in adding ascii only case-insensitive methods to memchr? I would really like to use just memchr and not need to use any mof the logic from quick-strings.

samuelcolvin avatar Jul 23 '24 20:07 samuelcolvin

cc @alamb who I think might be interested in this discussion.

samuelcolvin avatar Jul 23 '24 20:07 samuelcolvin

The "search in ASCII case insensitive mode" is something that I think would probably have a good home in this crate, but it is a big scope increase. That is, it's something that I do think would be nice to support, but it's a tall order. Of course, an initial implementation could be naive (as long as it's not naive algorithmically, i.e., optimal worst case behavior) I suppose. But there are likely SIMD tricks specific to ASCII case insensitive search that we'll want to employ. That's probably easily ~1 month of my free coding time in the evenings.

But yeah, your strategy makes sense. I suspect what you're fighting is actually just regex overhead itself. The regex crate is, quite literally, doing exactly the optimization you're trying to do manually. The actual regex engine won't be used at all for simple ASCII-only literal queries. It will just defer to this crate. But there are abstraction layers it has to cut through at search time... Maybe there is a way to improve those, but I suspect there will always be cases like this where it will be faster for the caller to do this sort of optimization on their own unfortunately. If you were searching larger haystacks, then you wouldn't have to worry about this at all.

BurntSushi avatar Jul 23 '24 21:07 BurntSushi

An approach we have taken elsewhere in arrow-rs is to literally special case the different implementations

So maybe we could do something like this:

if haystack.len() < 1024 { // or whatever the constant happens to be
  // use small-strings implementation
} else {
  // use more general purpose std or memchr etc
}

This would of course result in more code / bigger binary

alamb avatar Jul 23 '24 21:07 alamb

. The "prebuilt" benchmarks, on the other hand, are allowed to accept a needle, build a searcher and then reuse that search for subsequent use. The "prebuilt" benchmarks demonstrate the advantages of a superior API when the needle is invariant.

I think this is an excellent observation @BurntSushi

@samuelcolvin maybe this is a good hint that we could make the arrow-rs substring kernels behave more like the regex kernes and "compile / prepare" the matcher

Given the kernels typically operate on batches of 4000 or 8000ish rows we can probably amortize the "prepare the needle" step well

alamb avatar Jul 23 '24 21:07 alamb

See https://github.com/apache/arrow-rs/pull/6118, https://github.com/apache/arrow-rs/pull/6128 and https://github.com/apache/arrow-rs/pull/6131 for implementation in arrow-rs.

samuelcolvin avatar Jul 26 '24 12:07 samuelcolvin