blake2_simd blake2-rfc is slightly faster than the portable implementation

blake2-rfc is slightly faster than the portable implementation

Open oconnor663 opened this issue 5 years ago • 4 comments

https://github.com/cesarb/blake2-rfc

I measure it to be about 2% faster than portable.rs. Not yet sure why, though it might be using some SIMD under the covers, or maybe getting optimized to SSE2 by the compiler.

However, the relationship is reversed if I set RUSTFLAGS="-C target-cpu=native -C target-feature=-avx2". No idea why. Again, still a small difference. Notably, both implementations tank their performance if I allow them to use AVX2.

Nov 01 '18 04:11 oconnor663

I thought it might be because blake2-rfc was getting autovectorized, but looking at the output of cargo asm that doesn't seem to be the case. So I'm still not sure where the difference comes from.

May 24 '19 15:05 oconnor663

When I try it on ARM I get the opposite result. Should look at 32-bit ARM at some point.

Aug 08 '19 14:08 oconnor663

@oconnor663 i got same performance (vs blake2-rfc)

# code copy from https://github.com/shadowsocks/crypto2/tree/dev/src/hash/blake2b
git clone https://github.com/LuoZijun/test_blake2b/
cargo bench

Aug 28 '21 12:08 LuoZijun

@oconnor663 I tried with ARM Neoverse N1, blake2-rfc is slightly faster.

The code: https://github.com/gemtek-indonesia/blake2b256-bench/blob/249cac1bf8788c224f45990d607c4b510a92c862/src/main.rs#L103-L134

And compiled it with:

RUSTFLAGS="-C target-cpu=native -C codegen-units=1" cargo build --release

Jan 01 '23 14:01 Ujang360

blake2_simd blake2_simd copied to clipboard

blake2-rfc is slightly faster than the portable implementation

blake2_simd
blake2_simd copied to clipboard