zig std.mem.reverse: Improve performance

I noticed that std.mem.reverse wasn't getting optimized to use SIMD while optimizing a program a while ago, and decided to make my own faster implementation. After refining it a bit this is my best effort at making it as efficient and readable as possible. For benchmarks see this repo. Any feedback is very welcome

Jun 29 '24 13:06 JonathanHallstrom

I am quite certain that std.mem.reverse will get SIMD optimized. It would use something like vperm*.

Could you please post benchmarks on this PR that showcase how it's faster than the current implementation?
You're building in ReleaseSmall in the benchmark https://github.com/JonathanHallstrom/array_reversing/blob/main/benchmarking/build.sh#L12

Jun 29 '24 13:06 Rexicon226

If you'd like I can send the benchmarks directly here, or you can see them at this link https://github.com/JonathanHallstrom/array_reversing/tree/main/bench_results. They are a little bit out of date but I will try to update them asap.

I was unable to get the current implementation to output SIMD instructions (using vpshuf* or vperm* like you suggest). Code generation diffs for a few input sizes here.

Edit: that godbolt link is pr cluttered, slightly cleaned up version.

Jun 29 '24 14:06 JonathanHallstrom

If you'd like I can send the benchmarks directly here, or you can see them at this link https://github.com/JonathanHallstrom/array_reversing/tree/main/bench_results. They are a little bit out of date but I will try to update them asap.

It would be nice to have them here, thanks.

I was unable to get the current implementation to output SIMD instructions (using vpshuf* or vperm* like you suggest). Code generation diffs for a few input sizes here.

Here's a godbolt I quickly mocked up: https://zig.godbolt.org/z/3rr6joTa7 If you build with ReleaseSmall, it's very rare that you will get SIMD instructions as they are quite large.

Jun 29 '24 14:06 Rexicon226

You're quite right that constant size buffers get optimized, I'm relying on that in my implementation. It probably would have been good if I clarified that I meant I'm not seeing any SIMD output for slices with runtime known sizes.

Jun 29 '24 14:06 JonathanHallstrom

bench_u8 bench_u64 Here are some of the benchmark results from my machine. memset is included to give a rough estimate of the memory bandwidth limit.

Jun 29 '24 14:06 JonathanHallstrom

Note that these benchmarks are not ideal as they're generated with the google benchmark library, and so the actual benchmarking code is not written in zig.

Edit: This was done because it's the library I'm familiar with, and I was unable to find a comparable zig library.

Jun 29 '24 14:06 JonathanHallstrom

You're quite right that constant size buffers get optimized, I'm relying on that in my implementation. It probably would have been good if I clarified that I meant I'm not seeing any SIMD output for slices with runtime known sizes.

That makes sense.

Note that these benchmarks are not ideal as they're generated with the google benchmark library, and so the actual benchmarking code is not written in zig.

Edit: This was done because it's the library I'm familiar with, and I was unable to find a comparable zig library.

It doesn't really need to be a library. Just running it a thousand times and using a cycle counter with a clflush in between runs would be just fine.

Jun 29 '24 14:06 Rexicon226

~~Just noticed that my change seems to make constant size array case worse.~~ Accidentally used the wrong version of the code. Heres the up to date version, no change for constant size arrays as far as i can tell.

Jun 29 '24 14:06 JonathanHallstrom

~Just noticed that my change seems to make constant size array case worse.~ Accidentally used the wrong version of the code. Heres the up to date version, no change for constant size arrays as far as i can tell.

That does look better yeah. I'm OK with this optimization.

Jun 29 '24 14:06 Rexicon226

I'm rerunning some of the benchmarks now so they'll be up to date.

Jun 29 '24 14:06 JonathanHallstrom

It doesn't really need to be a library. Just running it a thousand times and using a cycle counter with a clflush in between runs would be just fine.

how would i do a clflush? do i need inline asm?

Jul 01 '24 13:07 JonathanHallstrom

It doesn't really need to be a library. Just running it a thousand times and using a cycle counter with a clflush in between runs would be just fine.

how would i do a clflush? do i need inline asm?

I've adapted my usual benchmark script to this. https://gist.github.com/Rexicon226/b533e0f1ec317b873cff691f54e63364

It looks good to me.

Jul 01 '24 18:07 Rexicon226

i dont understand why the macos release CI failed, did it fail to run?

Jul 02 '24 07:07 JonathanHallstrom

i dont understand why the macos release CI failed, did it fail to run?

#20473

Jul 02 '24 07:07 Rexicon226

note that this will also make mem.rotate faster since its just 3 reverses

Jul 06 '24 14:07 JonathanHallstrom

Nice, thanks!

Jul 21 '24 08:07 andrewrk