memchr icon indicating copy to clipboard operation
memchr copied to clipboard

Port the memchr implementation to "genericsimd"

Open alexcrichton opened this issue 3 years ago • 2 comments

This commit ports the various memchr algorithms implemented for SSE2 at src/memchr/x86/sse2.rs to the Vector trait. This added a few minor methods to the Vector trait and the rewrite is hopefully quite straightforward.

Afterwards the src/memchr/x86/avx.rs module is deleted because the theory is that it's simply the AVX2-using version of the new generic code (basically a new type parameter). Further changes were made to all the memchr-related functions to also enable WebAssembly acceleration with the v128 type. Ideally there's only one place necessary for new architectures to specify their implementation as well.

This method of abstraction is distinct from the memmem module, however. I've tried to draw inspiration from the previous design of the memchr module but I believe that the memmem module is newer and so further refactorings may be necessary to align the two in style.

alexcrichton avatar Dec 23 '21 00:12 alexcrichton

@BurntSushi ok this is my old branch dusted off. I wanted to open this up to start CI testing and figure out what's broken. I haven't had a chance to benchmark this thoroughly yet, especially with the deletion of the avx2 module assuming that the __m256i-generic implementation is the same. I'll want to do that before actually landing this.

Otherwise though I figured it might be good to post something that can be poked aroun.

alexcrichton avatar Dec 23 '21 00:12 alexcrichton

Ok the PR as-is has these numbers. To generate these numbers I ran:

$ cargo bench -- 'mem.*chr.*krate' --save-baseline $name

in the bench directory. For generating the reports I ran:

$ critcmp -t 5 generic master

to hopefully cut down on the noise

AVX2 implementation

These are generated with the PR as-is since the machine I'm running on run-time-detected AVX2:

Screen Shot 2021-12-25 at 12 09 17 AM
raw data ``` group generic master ----- ------- ------ memchr1/krate/empty/never 1.74 0.9±0.00ns 0 B/sec 1.00 0.5±0.00ns 0 B/sec memchr1/krate/huge/never 1.15 15.1±0.12µs 36.7 GB/sec 1.00 13.1±0.11µs 42.3 GB/sec memchr1/krate/huge/uncommon 1.00 110.4±0.12µs 5.0 GB/sec 1.06 117.0±0.55µs 4.7 GB/sec memchr1/krate/small/never 1.35 17.4±0.20ns 35.6 GB/sec 1.00 12.9±0.19ns 48.0 GB/sec memchr1/krate/small/rare 1.18 24.8±0.07ns 24.9 GB/sec 1.00 21.1±0.22ns 29.3 GB/sec memchr1/krate/small/uncommon 1.06 64.4±0.30ns 9.6 GB/sec 1.00 61.1±0.56ns 10.1 GB/sec memchr1/krate/tiny/never 1.27 9.3±0.08ns 6.9 GB/sec 1.00 7.3±0.03ns 8.8 GB/sec memchr1/krate/tiny/rare 1.15 13.5±0.05ns 4.8 GB/sec 1.00 11.7±0.06ns 5.5 GB/sec memchr1/krate/tiny/uncommon 1.42 47.0±0.13ns 1398.6 MB/sec 1.00 33.2±0.11ns 1981.4 MB/sec memchr2/krate/empty/never 1.41 1.0±0.01ns 0 B/sec 1.00 0.7±0.00ns 0 B/sec memchr2/krate/huge/rare 1.00 24.8±0.12µs 22.3 GB/sec 1.08 26.7±0.23µs 20.7 GB/sec memchr2/krate/huge/uncommon 1.00 225.4±1.35µs 2.5 GB/sec 1.05 236.9±2.13µs 2.3 GB/sec memchr2/krate/small/never 1.00 24.7±0.23ns 25.1 GB/sec 1.20 29.6±1.31ns 20.9 GB/sec memchr2/krate/small/rare 1.00 41.2±0.22ns 15.0 GB/sec 1.29 53.0±0.55ns 11.7 GB/sec memchr2/krate/small/uncommon 1.00 150.8±0.43ns 4.1 GB/sec 1.22 183.4±1.04ns 3.4 GB/sec memchr2/krate/tiny/never 1.00 8.8±0.04ns 7.3 GB/sec 1.35 11.9±0.09ns 5.4 GB/sec memchr3/krate/empty/never 1.52 1.0±0.01ns 0 B/sec 1.00 0.6±0.01ns 0 B/sec memchr3/krate/huge/never 1.22 30.8±0.12µs 18.0 GB/sec 1.00 25.3±0.20µs 21.9 GB/sec memchr3/krate/huge/rare 1.18 37.4±0.14µs 14.8 GB/sec 1.00 31.7±0.23µs 17.5 GB/sec memchr3/krate/huge/uncommon 1.00 268.1±0.35µs 2.1 GB/sec 1.07 285.7±0.52µs 1986.0 MB/sec memchr3/krate/small/never 1.39 41.1±0.36ns 15.1 GB/sec 1.00 29.6±0.08ns 20.9 GB/sec memchr3/krate/small/rare 1.11 60.5±0.13ns 10.2 GB/sec 1.00 54.6±0.24ns 11.3 GB/sec memchr3/krate/tiny/never 1.00 9.8±0.35ns 6.5 GB/sec 1.16 11.4±0.04ns 5.7 GB/sec memrchr1/krate/empty/never 1.71 1.8±0.05ns 0 B/sec 1.00 1.1±0.00ns 0 B/sec memrchr1/krate/huge/never 1.14 14.4±0.57µs 38.4 GB/sec 1.00 12.7±0.05µs 43.7 GB/sec memrchr3/krate/small/common 1.20 925.5±4.70ns 684.2 MB/sec 1.00 772.8±1.44ns 819.4 MB/sec memrchr3/krate/small/rare 1.10 57.8±0.19ns 10.7 GB/sec 1.00 52.4±0.31ns 11.8 GB/sec memrchr3/krate/small/uncommon 1.23 230.7±0.98ns 2.7 GB/sec 1.00 186.9±1.31ns 3.3 GB/sec memrchr3/krate/tiny/never 1.05 11.4±0.10ns 5.6 GB/sec 1.00 10.8±0.04ns 5.9 GB/sec memrchr3/krate/tiny/rare 1.17 37.1±0.20ns 1773.9 MB/sec 1.00 31.6±0.19ns 2.0 GB/sec memrchr3/krate/tiny/uncommon 1.23 145.6±0.59ns 451.9 MB/sec 1.00 118.2±0.32ns 556.7 MB/sec ```

SSE2 implementation

For these numbers I edited the run-time detection to skip the avx2 feature detection, meaning that the SSE2 implementation was forced to be used

Screen Shot 2021-12-25 at 12 11 05 AM
raw data ``` group generic-no-avx master-no-avx ----- -------------- ------------- memchr1/krate/huge/never 1.10 30.6±0.08µs 18.1 GB/sec 1.00 27.8±0.42µs 19.9 GB/sec memchr1/krate/huge/rare 1.08 32.8±0.07µs 16.9 GB/sec 1.00 30.2±0.09µs 18.3 GB/sec memchr1/krate/small/never 1.07 41.1±1.29ns 15.0 GB/sec 1.00 38.6±1.18ns 16.0 GB/sec memchr1/krate/small/rare 1.07 47.6±0.17ns 13.0 GB/sec 1.00 44.6±0.58ns 13.9 GB/sec memchr3/krate/small/rare 1.12 120.6±0.94ns 5.1 GB/sec 1.00 108.2±0.39ns 5.7 GB/sec memchr3/krate/small/uncommon 1.09 256.9±1.20ns 2.4 GB/sec 1.00 235.9±2.89ns 2.6 GB/sec memchr3/krate/tiny/never 1.07 15.6±0.04ns 4.1 GB/sec 1.00 14.5±0.04ns 4.4 GB/sec memrchr1/krate/empty/never 1.05 1.9±0.03ns 0 B/sec 1.00 1.8±0.01ns 0 B/sec memrchr1/krate/tiny/never 1.00 9.4±0.04ns 6.8 GB/sec 1.12 10.5±0.06ns 6.1 GB/sec memrchr1/krate/tiny/rare 1.00 13.4±0.04ns 4.8 GB/sec 1.08 14.5±0.07ns 4.4 GB/sec memrchr2/krate/huge/common 1.00 566.1±1.09µs 1002.3 MB/sec 1.21 684.4±2.13µs 829.0 MB/sec memrchr2/krate/huge/uncommon 1.00 272.9±0.32µs 2.0 GB/sec 1.06 288.4±0.58µs 1967.3 MB/sec memrchr2/krate/small/common 1.00 462.8±2.07ns 1368.4 MB/sec 1.36 629.7±2.91ns 1005.6 MB/sec memrchr2/krate/small/uncommon 1.00 167.7±0.95ns 3.7 GB/sec 1.08 181.4±0.69ns 3.4 GB/sec memrchr2/krate/tiny/never 1.09 13.0±0.22ns 4.9 GB/sec 1.00 11.9±0.04ns 5.4 GB/sec memrchr2/krate/tiny/rare 1.17 36.7±0.32ns 1793.7 MB/sec 1.00 31.2±0.17ns 2.1 GB/sec memrchr2/krate/tiny/uncommon 1.00 85.5±0.39ns 769.2 MB/sec 1.11 95.3±0.35ns 690.5 MB/sec memrchr3/krate/huge/never 1.00 52.5±0.17µs 10.6 GB/sec 1.08 56.6±0.21µs 9.8 GB/sec memrchr3/krate/huge/rare 1.00 62.9±1.26µs 8.8 GB/sec 1.06 66.5±0.92µs 8.3 GB/sec ```

I'm not super certain as to how much variation these benchmarks have, but that's at least the current state! My goal was basically parity with the master branch as-is, but I may need to dig into some of the cases here because the speedups/slowdowns are bigger than I'd expect.

alexcrichton avatar Dec 25 '21 06:12 alexcrichton

Thanks so much for working on this! I've finally gotten around to taking a look at this PR. I'm also trying to knock out #76 at the same time. Unfortunately, the neon ISA doesn't have a movemask routine, which means that if I wind up going with a generic version, it's going to look a bit different.

I'm also having a little trouble with the runtime feature detection macros. I'm also having trouble with all the conditional compilation in memmem as well (where I also plan to add aarch64 support). Long story short, I think I'm just going to try and think more holistically about how to manage all the complexity here and try to find a way to push down and encapsulate the condition compilation in a more sensible way than is being done today.

Thanks again for working on this and getting the wasm stuff working in memmem. My plan is to get x86, wasm and aarch64 supported in memchr as well.

BurntSushi avatar Jul 12 '23 00:07 BurntSushi