wide icon indicating copy to clipboard operation
wide copied to clipboard

`wide::u8x32` is only slower than `wide::u8x16` on i9-13900K/i9-14900K

Open wtdcode opened this issue 8 months ago • 10 comments

Hello! Thanks for your brilliant work. We have a micro benchmark for our task: https://github.com/wtdcode/libafl_simd_bench however we notice a performance regression of wide::u8x32 compared to wide::u8x16. This is specifically happening for i9-13900K and i9-14900K though binding to P cores while the performance is mostly the same on AMD series (less slowdown) and aarch64. Is there anything wrong with wide? Or is it just CPU features?

wtdcode avatar Apr 07 '25 03:04 wtdcode

Depending on CPU features, wide treats u8x32 as either one m256i value (when avx2 is available at build time), or as two u8x16 values. Your target-cpu=native builds on x86_64 would I assume have the avx2 feature enabled, and so go through actual AVX2 code paths.

However, I don't really know enough about SIMD benchmarking to say more than that.

Lokathor avatar Apr 07 '25 04:04 Lokathor

Depending on CPU features, wide treats u8x32 as either one m256i value (when avx2 is available at build time), or as two u8x16 values. Your target-cpu=native builds on x86_64 would I assume have the avx2 feature enabled, and so go through actual AVX2 code paths.

How about aarch64? Is it possible that AVX2 code paths are inherently slower and aarch64 goes two u8x16 code paths?

wtdcode avatar Apr 07 '25 04:04 wtdcode

I would close this as I found the bottleneck is not wide::u8x32 but another loop not vectorized by perf. Will create another issue if I find anything else. @Lokathor Thanks for the instant feedback!

wtdcode avatar Apr 07 '25 04:04 wtdcode

Just to answer the question: aarch64 has no special handling at this time. Explicit aarch64 intrinsic use wasn't stable when I wrote most of the lib, and if it is stable now no one has cared enough to PR in said support.

Lokathor avatar Apr 07 '25 07:04 Lokathor

Just to answer the question: aarch64 has no special handling at this time. Explicit aarch64 intrinsic use wasn't stable when I wrote most of the lib, and if it is stable now no one has cared enough to PR in said support.

Now our issue is a bit different. We would like _mm_mask_max_epu8 but don't find it anywhere from wide or safe_arch =/.

wtdcode avatar Apr 07 '25 07:04 wtdcode

https://doc.rust-lang.org/core/arch/x86/fn._mm_mask_max_epu8.html

This is a nightly-only experimental API. (stdarch_x86_avx512 #111137)

You can PR it into safe_arch behind a feature flag, then make wide use it with a feature flag, but I don't have an avx512 cpu and i doubt the CI runners have such a CPU.

Lokathor avatar Apr 07 '25 08:04 Lokathor

https://doc.rust-lang.org/core/arch/x86/fn._mm_mask_max_epu8.html

This is a nightly-only experimental API. (stdarch_x86_avx512 #111137)

You can PR it into safe_arch behind a feature flag, then make wide use it with a feature flag, but I don't have an avx512 cpu and i doubt the CI runners have such a CPU.

Oh sorry I mean the series of intrinsic is missing. Like I can't do this for u8x16/32, not really need avx512. Or do I miss something in wide? Generally logic I'm trying to vectorize is:

https://github.com/wtdcode/libafl_simd_bench/blob/19200387552bbb02b6d16db033312d7b14b7106a/src/main.rs#L126-L130

wtdcode avatar Apr 07 '25 08:04 wtdcode

Specifically, the loop could be vectorized by LLVM when there are 16 bytes (m128) with something like:

pmaxub %xmm0, %xmm2
pcmpeqb %xmm1, %xmm2
pmovmskb %xmm2, %eax

I know we have move_mask for pmovmskb but I can't find pmaxub equavelent.

Also with a manual loop unrolling, LLVM will do vectorization and get the similiar performance.

wtdcode avatar Apr 07 '25 08:04 wtdcode

I don't have an answer at the moment, but this is enough of an open issue i think we can reopen at least. maybe someone else will know

Lokathor avatar Apr 07 '25 08:04 Lokathor

Isn't that just u8x16::max?

mcroomp avatar May 08 '25 16:05 mcroomp