blake2_simd
blake2_simd copied to clipboard
blake2-rfc is slightly faster than the portable implementation
https://github.com/cesarb/blake2-rfc
I measure it to be about 2% faster than portable.rs
. Not yet sure why, though it might be using some SIMD under the covers, or maybe getting optimized to SSE2 by the compiler.
However, the relationship is reversed if I set RUSTFLAGS="-C target-cpu=native -C target-feature=-avx2"
. No idea why. Again, still a small difference. Notably, both implementations tank their performance if I allow them to use AVX2.
I thought it might be because blake2-rfc
was getting autovectorized, but looking at the output of cargo asm
that doesn't seem to be the case. So I'm still not sure where the difference comes from.
When I try it on ARM I get the opposite result. Should look at 32-bit ARM at some point.
@oconnor663 i got same performance (vs blake2-rfc
)
# code copy from https://github.com/shadowsocks/crypto2/tree/dev/src/hash/blake2b
git clone https://github.com/LuoZijun/test_blake2b/
cargo bench
@oconnor663 I tried with ARM Neoverse N1, blake2-rfc
is slightly faster.
The code: https://github.com/gemtek-indonesia/blake2b256-bench/blob/249cac1bf8788c224f45990d607c4b510a92c862/src/main.rs#L103-L134
And compiled it with:
RUSTFLAGS="-C target-cpu=native -C codegen-units=1" cargo build --release