Amyspark comments

Results 107 comments of


                                            Amyspark

Add rcp and rsqrt

I fixed this with #710, @serge-sans-paille with #679.

dispatch assumes that target architectures are supported

> This means that in general it is not safe to run a program compiled with -mavx2 (for example) on a CPU that doesn't support AVX2, even if any code...

dispatch assumes that target architectures are supported

> https://github.com/xtensor-stack/xsimd/pull/675 proposes an approach to fix that issue, I'd happily take feedbacks. I'd say to templatize the architecture parameter or make it part of the function signature, the current...

Support gather for different sizes of types on data and indices

This is what I did to hand-optimize two cases we use at Krita: https://github.com/xtensor-stack/xsimd/blob/c7567bbedebcfbf3ba95304ff1a6722b32a0d63f/include/xsimd/arch/xsimd_avx2.hpp#L350-L369 Instead of using separate batch types, I would suggest to SFINAE on the size of the...

Feature/all inline

The performance bug has been reported to MSVC [here](https://developercommunity.visualstudio.com/t/2x-performance-loss-when-using-__forcein/1592199).

Feature/all inline

@serge-sans-paille upon further review, it seems that, instead of e.g. shifting a register right then using the result, MSVC spills the register on the stack, loads it, shifts, pushes and...

Implement an inline macro to have maximum performance on MSVC

I'll test this branch tonight with my benchmark.

Implement an inline macro to have maximum performance on MSVC

@serge-sans-paille, #645 has no effect on my benchmark; MSVC still doesn't inline xsimd's methods, resulting in a 50% perf hit compared to `__forceinline`. (Using `/Ob3 /O2 /Gv /Oi`)

Implement an inline macro to have maximum performance on MSVC

> @amyspark #645 updated with always inline, can you check if that fixes your issue? That does the trick! But the `friend` functions in `xsimd::batch`, e.g. https://github.com/xtensor-stack/xsimd/blob/54aa8e72bc7cda47907879f5ad2a9c11b4c127e7/include/xsimd/types/xsimd_batch.hpp#L163-L171 still need to...

Template lane permutation

It's a tongue-twister: xsimd's `swizzle` is the equivalent of Intel's `shuffle`s. Here I need *whole lane* (128 for AVX, 256 for AVX512) `swizzle`s, which in Intel's lingo is `permute`s.