wide `wide::u8x32` is only slower than `wide::u8x16` on i9-13900K/i9-14900K

Hello! Thanks for your brilliant work. We have a micro benchmark for our task: https://github.com/wtdcode/libafl_simd_bench however we notice a performance regression of wide::u8x32 compared to wide::u8x16. This is specifically happening for i9-13900K and i9-14900K though binding to P cores while the performance is mostly the same on AMD series (less slowdown) and aarch64. Is there anything wrong with wide? Or is it just CPU features?

Apr 07 '25 03:04 wtdcode

Depending on CPU features, wide treats u8x32 as either one m256i value (when avx2 is available at build time), or as two u8x16 values. Your target-cpu=native builds on x86_64 would I assume have the avx2 feature enabled, and so go through actual AVX2 code paths.

However, I don't really know enough about SIMD benchmarking to say more than that.

Apr 07 '25 04:04 Lokathor

Depending on CPU features, wide treats u8x32 as either one m256i value (when avx2 is available at build time), or as two u8x16 values. Your target-cpu=native builds on x86_64 would I assume have the avx2 feature enabled, and so go through actual AVX2 code paths.

How about aarch64? Is it possible that AVX2 code paths are inherently slower and aarch64 goes two u8x16 code paths?

Apr 07 '25 04:04 wtdcode

I would close this as I found the bottleneck is not wide::u8x32 but another loop not vectorized by perf. Will create another issue if I find anything else. @Lokathor Thanks for the instant feedback!

Apr 07 '25 04:04 wtdcode

Just to answer the question: aarch64 has no special handling at this time. Explicit aarch64 intrinsic use wasn't stable when I wrote most of the lib, and if it is stable now no one has cared enough to PR in said support.

Apr 07 '25 07:04 Lokathor

Just to answer the question: aarch64 has no special handling at this time. Explicit aarch64 intrinsic use wasn't stable when I wrote most of the lib, and if it is stable now no one has cared enough to PR in said support.

Now our issue is a bit different. We would like _mm_mask_max_epu8 but don't find it anywhere from wide or safe_arch =/.

Apr 07 '25 07:04 wtdcode

https://doc.rust-lang.org/core/arch/x86/fn._mm_mask_max_epu8.html

This is a nightly-only experimental API. (stdarch_x86_avx512 #111137)

You can PR it into safe_arch behind a feature flag, then make wide use it with a feature flag, but I don't have an avx512 cpu and i doubt the CI runners have such a CPU.

Apr 07 '25 08:04 Lokathor

https://doc.rust-lang.org/core/arch/x86/fn._mm_mask_max_epu8.html

This is a nightly-only experimental API. (stdarch_x86_avx512 #111137)

You can PR it into safe_arch behind a feature flag, then make wide use it with a feature flag, but I don't have an avx512 cpu and i doubt the CI runners have such a CPU.

Oh sorry I mean the series of intrinsic is missing. Like I can't do this for u8x16/32, not really need avx512. Or do I miss something in wide? Generally logic I'm trying to vectorize is:

https://github.com/wtdcode/libafl_simd_bench/blob/19200387552bbb02b6d16db033312d7b14b7106a/src/main.rs#L126-L130

Apr 07 '25 08:04 wtdcode

Specifically, the loop could be vectorized by LLVM when there are 16 bytes (m128) with something like:

pmaxub %xmm0, %xmm2
pcmpeqb %xmm1, %xmm2
pmovmskb %xmm2, %eax

I know we have move_mask for pmovmskb but I can't find pmaxub equavelent.

Also with a manual loop unrolling, LLVM will do vectorization and get the similiar performance.

Apr 07 '25 08:04 wtdcode

I don't have an answer at the moment, but this is enough of an open issue i think we can reopen at least. maybe someone else will know

Apr 07 '25 08:04 Lokathor

Isn't that just u8x16::max?

May 08 '25 16:05 mcroomp