std.mem.reverse: Improve performance
I noticed that std.mem.reverse wasn't getting optimized to use SIMD while optimizing a program a while ago, and decided to make my own faster implementation. After refining it a bit this is my best effort at making it as efficient and readable as possible. For benchmarks see this repo. Any feedback is very welcome
I am quite certain that std.mem.reverse will get SIMD optimized. It would use something like vperm*.
- Could you please post benchmarks on this PR that showcase how it's faster than the current implementation?
- You're building in
ReleaseSmallin the benchmark https://github.com/JonathanHallstrom/array_reversing/blob/main/benchmarking/build.sh#L12
If you'd like I can send the benchmarks directly here, or you can see them at this link https://github.com/JonathanHallstrom/array_reversing/tree/main/bench_results. They are a little bit out of date but I will try to update them asap.
I was unable to get the current implementation to output SIMD instructions (using vpshuf* or vperm* like you suggest). Code generation diffs for a few input sizes here.
Edit: that godbolt link is pr cluttered, slightly cleaned up version.
If you'd like I can send the benchmarks directly here, or you can see them at this link https://github.com/JonathanHallstrom/array_reversing/tree/main/bench_results. They are a little bit out of date but I will try to update them asap.
It would be nice to have them here, thanks.
I was unable to get the current implementation to output SIMD instructions (using
vpshuf*orvperm*like you suggest). Code generation diffs for a few input sizes here.
Here's a godbolt I quickly mocked up: https://zig.godbolt.org/z/3rr6joTa7
If you build with ReleaseSmall, it's very rare that you will get SIMD instructions as they are quite large.
You're quite right that constant size buffers get optimized, I'm relying on that in my implementation. It probably would have been good if I clarified that I meant I'm not seeing any SIMD output for slices with runtime known sizes.
Here are some of the benchmark results from my machine.
memset is included to give a rough estimate of the memory bandwidth limit.
Note that these benchmarks are not ideal as they're generated with the google benchmark library, and so the actual benchmarking code is not written in zig.
Edit: This was done because it's the library I'm familiar with, and I was unable to find a comparable zig library.
You're quite right that constant size buffers get optimized, I'm relying on that in my implementation. It probably would have been good if I clarified that I meant I'm not seeing any SIMD output for slices with runtime known sizes.
That makes sense.
Note that these benchmarks are not ideal as they're generated with the google benchmark library, and so the actual benchmarking code is not written in zig.
Edit: This was done because it's the library I'm familiar with, and I was unable to find a comparable zig library.
It doesn't really need to be a library. Just running it a thousand times and using a cycle counter with a clflush in between runs would be just fine.
~~Just noticed that my change seems to make constant size array case worse.~~ Accidentally used the wrong version of the code. Heres the up to date version, no change for constant size arrays as far as i can tell.
~Just noticed that my change seems to make constant size array case worse.~ Accidentally used the wrong version of the code. Heres the up to date version, no change for constant size arrays as far as i can tell.
That does look better yeah. I'm OK with this optimization.
I'm rerunning some of the benchmarks now so they'll be up to date.
It doesn't really need to be a library. Just running it a thousand times and using a cycle counter with a
clflushin between runs would be just fine.
how would i do a clflush? do i need inline asm?
It doesn't really need to be a library. Just running it a thousand times and using a cycle counter with a
clflushin between runs would be just fine.how would i do a
clflush? do i need inline asm?
I've adapted my usual benchmark script to this. https://gist.github.com/Rexicon226/b533e0f1ec317b873cff691f54e63364
It looks good to me.
i dont understand why the macos release CI failed, did it fail to run?
i dont understand why the macos release CI failed, did it fail to run?
#20473
note that this will also make mem.rotate faster since its just 3 reverses
Nice, thanks!