xsimd
xsimd copied to clipboard
Template lane permutation
Hi!
Would it be possible to template lane permutation (ie. _mm256_permute2f128_*) on AVX and higher architectures? I've found an use for them in some handmade pixel interleavers of mine.
Isn't that a request for a shuffle operation, where we currently only provide swizzle?
NB: I'm not 100% sure about the terminology here, my understanding is that swizzle operates on a vector and reorder slots based on a mask, while shuffle operates on two vectors and picks + reorder slots based on a mask.
It's a tongue-twister: xsimd's swizzle is the equivalent of Intel's shuffles.
Here I need whole lane (128 for AVX, 256 for AVX512) swizzles, which in Intel's lingo is permutes.
well, in that case xsimd's swizzle with the correct mask should be equivalent (but less efficient) correct ?
but less efficient
Hence my request for supporting these correctly; if even under the kernel::detail namespace, as is currently done with the batch merging primitives.
According to Intel intrinsic guide, _mm256_permute4x64_pd has a latency of 3 and a throughput of 1, just as _mm256_permute2f128_pd, so I don't see an interest of using specialized instructions for the implementation of swizzle. The interest of _mm256_permute2f128_pd is in it accepting two different arguments, but then that's not longer a one-operand swizzle
@amyspark, any thoughts on the above?
I've used them in the RGBA interleaving code here. They're a port of what Clang calculates for __builtin_shufflevector. The permutes in particular are used to assemble the high lanes of two different vectors.
@amyspark can you double check if the newly implemented (#925) xsimd::shuffle matches your need?
Closing as theoretically, #925 fixes the issue.