memchr
memchr copied to clipboard
Poor performance on Zen 1/Threadripper due to loop unrolling
I have noticed that the SSE2 implementation of memchr in this crate unrolls the loop 4x. Unfortunately, this seems to lead to a significant performance drop on processors on the Zen 1 architecture. Benchmarked on a TR 1950x, I see about 50-60% better performance compared to this crate when avoiding loop unrolling altogether. Below is a benchmark demonstrating this. The memchr implementation in the repository linked below is written in such a way so that you just have to change one constant (UNROLL_SIZE
) to change the amount of loop unrolling that the function uses for the main loop.
https://github.com/redzic/memchr-demonstration
Just clone the repository and run cargo run --release
to run the benchmark.
Increasing UNROLL_SIZE
leads to worse performance on my TR1950x, with 4x unrolling being basically the same speed as this crate which makes sense. However, when using UNROLL_SIZE = 8
, the performance difference between this crate's implementation and the custom implementation spikes again to being about 10% faster than this crate (i.e., this crate's SSE2 memchr is 90.9% as fast).
Would it be possible to tune the unroll factor, or possibly even do something similar to OpenBLAS, so that we query information about the CPU such as cache size or even exact CPU model, and dispatch code accordingly? Perhaps this functionality could be implemented behind some kind of feature flag.
Up front, I'll say that I don't have a lot of bandwidth to dedicate to this. And in order to do any kind of change like this, I would probably insist on a careful analysis. I would also want to know why glibc hasn't done something like this. IIRC, their implementation (in Assembly) does a similar amount of unrolling.
In principle I'm open to something like this, but it seems like a great complication. Although, if a 50% hit is at stake here, then perhaps a great complication is an acceptable hit.
With all that said, your benchmark looks like it's bunk to me. Or at least, not particularly representative and something that would be unlikely to benefit much from loop unrolling. Namely, it looks like you're only benchmarking on random data with a random needle. You're also using different random data for each of the memchr implementations instead of the same. Truly random data is no doubt a valid use case, but in my experience is not altogether common. You're more likely to be testing latency here instead of throughput, and the loop unrolling is primarily targeting throughput. So it seems to me like you've probably constructed a benchmark that is at least part going to magnify the impact of changes to loop unrolling in a way that prioritizes latency over throughput. No doubt a valid benchmark, but very incomplete.
IMO, the next step here is to plug your alternative implementation into memchr's considerably comprehensive benchmark harness and report the results there.
Sorry for the late response on this, I have been busy with a lot of stuff. From my simple benchmarking, it seems like you're right; there are definitely performance benefits in avoiding loop unrolling for smaller arrays, but for things like there being no matches in a huge array, I actually saw something like a 2x performance degradation compared to the unrolled loop version.
I think there are some possible improvements for the Memchr
struct iterator as well, specifically in that I think it's worth trying to store the cached mask so as to avoid reading from the same memory multiple times in the case of there being multiple matches in that area. Given that this iterator is how the benchmarks are performed, I think I might try implementing this caching mechanism to see if that brings any gains, especially since that might affect the relative performance between the 2 memchr implementations.
I'll try to continue working on this when I have time, but it will probably take me a while since I am also balancing quite a lot of things. Perhaps there are some other ways to optimize the code that could also be looked into. I've found that even for "simple" things like summing 32-bit integers in an array, a manual assembly implementation has a lot of room for improvement, by doing things like reading past the end of the array to avoid a residual loop (which seems OK in practice but not sure how acceptable that is for this crate; I saw a note about that in this crate), and jumping to a label for the remaining vector operations based on a lookup table (which as far as I know is not really possible to do in C or Rust, or at least it's very difficult if it somehow is possible).
by doing things like reading past the end of the array to avoid a residual loop (which seems OK in practice but not sure how acceptable that is for this crate; I saw a note about that in this crate)
Yeah it sounds like UB. If you can build a compelling argument that it's not and convince the unsafe WG that you're correct, then I'm fine with it.
My understanding is that tricks like this require dropping down into assembly. Definitely prefer to keep the implementations in Rust. Eventually I would like to move to the portable SIMD API in std once that stabilizes.
and jumping to a label for the remaining vector operations based on a lookup table (which as far as I know is not really possible to do in C or Rust, or at least it's very difficult if it somehow is possible).
Hmmm yeah, not sure how to do that outside of assembly. In theory there should bea code pattern in Rust that let's us compile down to something line that, but in practice I wouldn't know where to start with that.
Otherwise, looking at some improvements here would be great. Your ideas about iteration sound plausible.
I do generally prefer to optimize for throughput, but I'm not opposed to improvements in latency if the cost to throughout is small.
I don't think there are any specific or actionable next steps here based on the comments above.