lazymio

Results 721 comments of lazymio

> Depending on CPU features, `wide` treats `u8x32` as either one `m256i` value (when `avx2` is available at build time), or as two `u8x16` values. Your `target-cpu=native` builds on x86_64...

I would close this as I found the bottleneck is not `wide::u8x32` but another loop not vectorized by `perf`. Will create another issue if I find anything else. @Lokathor Thanks...

> Just to answer the question: aarch64 has no special handling at this time. Explicit aarch64 intrinsic use wasn't stable when I wrote most of the lib, and if it...

> https://doc.rust-lang.org/core/arch/x86/fn._mm_mask_max_epu8.html > > > This is a nightly-only experimental API. (stdarch_x86_avx512 [#111137](https://github.com/rust-lang/rust/issues/111137)) > > You can PR it into `safe_arch` behind a feature flag, then make `wide` use it...

Specifically, the loop could be vectorized by LLVM when there are 16 bytes (m128) with something like: ``` pmaxub %xmm0, %xmm2 pcmpeqb %xmm1, %xmm2 pmovmskb %xmm2, %eax ``` I know...

#2021 is having conflicts.

#1918 conflicts too but seems easy to merge.

#1903 is not resolved yet and there is no CI for loongarch. I have a side project for local testing: https://github.com/wtdcode/DebianOnQEMU?tab=readme-ov-file#loongarch64

TODO: Move all ``` DEF_HELPER_4(uc_tracecode, void, i32, i32, ptr, i64) DEF_HELPER_6(uc_traceopcode, void, ptr, i64, i64, i32, ptr, i64) ``` to tcg-runtime.h.

TODO: We should have 2.2.0rc1