Multiversioned AVX-512 functions lower into SSE intrinsics via `wide`
Profile: https://share.firefox.dev/4otLxuT
You can clearly see that <wide::f32x16_::f32x16 as core::ops::arith::Mul>::mul eventually lowers into core::core_arch::x86::sse::_mm_mul_ps
With RUSTFLAGS=-C target-cpu=native on Zen 4 it lowers into core::core_arch::x86::avx512f::_mm512_mul_ps instead.
~~Funnily enough, it's not even a big deal for DiF, it only improves by 10% for size 524288, but DiT gains +50% performance~~ with -C target-cpu=native at size 524288. After #41 there's a +10% improvement on for DiT and a 10% regression for DiF on Zen 4.
At size 64 there's +33% to be had in DiF and +110% in DiT.
To corroborate this issue, I've just been trying the profile example, and I can see that my AV2 + FMA Ryzen 5625U falls back to SSE instructions. Setting target-cpu to native brings me back to the full expected AVX2 + FMA instructions.
We used to have everything written in terms of std::simd but ported to wide to work on stable. We'll need to port to either fearless_simd or macerator because wide turned out to be incompatible with multiversioning.
My article about the state of SIMD in 2025 was just me writing down the research I did for this issue.
Are you interested in optimisation suggestions in the meantime, then, or perhaps it's wasted effort given it needs a full port anyway. I was just about to click submit on a new issue, regarding the usage of as_chunks in favour of chunks_exact. The conversion into f32x8 in the wide simd kernels spend a lot of time calling try_into().unwrap(), which disappears when as_chunks<LANES>() is used instead.
Optimizations are still very much welcome. They should apply regardless of the underlying SIMD library.
And part of the motivation for switching to wide was its better FMA performance compared to std::simd. It's good to explore the options.
Also, for benchmarking and profiling, prefer to use -C target-cpu=x86-64-v3 so that AVX2 is activated properly until multiversioning is fixed.