xsimd icon indicating copy to clipboard operation
xsimd copied to clipboard

Template lane permutation

Open amyspark opened this issue 3 years ago • 7 comments
trafficstars

Hi!

Would it be possible to template lane permutation (ie. _mm256_permute2f128_*) on AVX and higher architectures? I've found an use for them in some handmade pixel interleavers of mine.

amyspark avatar May 11 '22 22:05 amyspark

Isn't that a request for a shuffle operation, where we currently only provide swizzle? NB: I'm not 100% sure about the terminology here, my understanding is that swizzle operates on a vector and reorder slots based on a mask, while shuffle operates on two vectors and picks + reorder slots based on a mask.

serge-sans-paille avatar May 24 '22 18:05 serge-sans-paille

It's a tongue-twister: xsimd's swizzle is the equivalent of Intel's shuffles.

Here I need whole lane (128 for AVX, 256 for AVX512) swizzles, which in Intel's lingo is permutes.

amyspark avatar May 24 '22 18:05 amyspark

well, in that case xsimd's swizzle with the correct mask should be equivalent (but less efficient) correct ?

serge-sans-paille avatar May 24 '22 19:05 serge-sans-paille

but less efficient

Hence my request for supporting these correctly; if even under the kernel::detail namespace, as is currently done with the batch merging primitives.

amyspark avatar May 28 '22 13:05 amyspark

According to Intel intrinsic guide, _mm256_permute4x64_pd has a latency of 3 and a throughput of 1, just as _mm256_permute2f128_pd, so I don't see an interest of using specialized instructions for the implementation of swizzle. The interest of _mm256_permute2f128_pd is in it accepting two different arguments, but then that's not longer a one-operand swizzle

serge-sans-paille avatar May 31 '22 06:05 serge-sans-paille

@amyspark, any thoughts on the above?

serge-sans-paille avatar Jun 25 '22 06:06 serge-sans-paille

I've used them in the RGBA interleaving code here. They're a port of what Clang calculates for __builtin_shufflevector. The permutes in particular are used to assemble the high lanes of two different vectors.

amyspark avatar Jun 25 '22 13:06 amyspark

@amyspark can you double check if the newly implemented (#925) xsimd::shuffle matches your need?

serge-sans-paille avatar May 24 '23 20:05 serge-sans-paille

Closing as theoretically, #925 fixes the issue.

serge-sans-paille avatar Jun 06 '23 19:06 serge-sans-paille