PhastFT Multiversioned AVX-512 functions lower into SSE intrinsics via `wide`

Profile: https://share.firefox.dev/4otLxuT

You can clearly see that <wide::f32x16_::f32x16 as core::ops::arith::Mul>::mul eventually lowers into core::core_arch::x86::sse::_mm_mul_ps

With RUSTFLAGS=-C target-cpu=native on Zen 4 it lowers into core::core_arch::x86::avx512f::_mm512_mul_ps instead.

~~Funnily enough, it's not even a big deal for DiF, it only improves by 10% for size 524288, but DiT gains +50% performance~~ with -C target-cpu=native at size 524288. After #41 there's a +10% improvement on for DiT and a 10% regression for DiF on Zen 4.

At size 64 there's +33% to be had in DiF and +110% in DiT.

Oct 21 '25 14:10 Shnatsel

To corroborate this issue, I've just been trying the profile example, and I can see that my AV2 + FMA Ryzen 5625U falls back to SSE instructions. Setting target-cpu to native brings me back to the full expected AVX2 + FMA instructions.

Nov 15 '25 19:11 mfreeborn

We used to have everything written in terms of std::simd but ported to wide to work on stable. We'll need to port to either fearless_simd or macerator because wide turned out to be incompatible with multiversioning.

My article about the state of SIMD in 2025 was just me writing down the research I did for this issue.

Nov 15 '25 20:11 Shnatsel

Are you interested in optimisation suggestions in the meantime, then, or perhaps it's wasted effort given it needs a full port anyway. I was just about to click submit on a new issue, regarding the usage of as_chunks in favour of chunks_exact. The conversion into f32x8 in the wide simd kernels spend a lot of time calling try_into().unwrap(), which disappears when as_chunks<LANES>() is used instead.

Nov 15 '25 20:11 mfreeborn

Optimizations are still very much welcome. They should apply regardless of the underlying SIMD library.

And part of the motivation for switching to wide was its better FMA performance compared to std::simd. It's good to explore the options.

Nov 15 '25 20:11 Shnatsel

Also, for benchmarking and profiling, prefer to use -C target-cpu=x86-64-v3 so that AVX2 is activated properly until multiversioning is fixed.

Nov 15 '25 20:11 Shnatsel