slipstream
slipstream copied to clipboard
Shuffles/swizzles
While many algorithms can get by with only arithmetic operations, as soon as things get more complex, the need to move things across vector lanes or from vector to vector arises. For now, of those operations, slipstream only supports masked blend.
Now, in my experience, compilers are not very good at autovectorizing those operations, so you may want to exclude them from slipstream or at least put a warning on them for this reason. But since these are commonly used SIMD operations, I thought you might want to consider exposing them in the hope that compilers later get good at optimizing them, as you already do for gather and scatter.
Common shuffles that tend to have pretty good hardware support or require nontrivial code, which may thus call for special casing in the API, are...
- Lane rotation (e.g.
|x1 x2 ... xN|to|x2 x3 ... xN x1|) - Lane shift (e.g. a left-shift of 2 lanes turns
|x1 x2 ... xN|into|x3 x4 ... xN P P|where P is a user-chosen padding scalar). - Interleave and deinterleave (turn
|x1 x2 ... xN|and|y1 y2 ... yN|into|x1 y1 ... xN/2 yN/2|and|xN/2+1 xN/2+1 ... xN yN|+ reverse operation) - Transpose (turn a set of N vectors
|m11 m12 ... m1N|,|m21 m22 ... m2N|, ...,|mN1 mN2 ... mNN|into another set of N vectors|m11 m21 ... mN1|,|m12 m22 ... mN2|, ...,|m1N m2N ... mNN|).
The experimental std::simd API exposes some of those and may be used as a source of inspiration for naming and method signatures.
Hello
I must admit, I don't really have the time to put into writing it. But I have no objection to these APIs existing.
Would you want to send a pull request for these (preferably, in smaller chunk, let's say each API separately)? It would probably mean finding an example/test algorithm for each and seeing how bad the compiler is in auto-vectorizing it, etc.
Well, I have the same time problem, so it may take a while, but I'll try to contribute some of those.
- Interleave/deinterleave is easy, I'll just give you an example that does complex
A * conj(B)multiplication. That's something signal processing people do all the time, and if they're not lucky enough to use a civilized FFT library that emits separate arrays of real and imaginary part, they need interleave/deinterleave shuffles to do it. - Transpose and lane shift I need to think a bit more, because my current use case (stencil) is pretty big for a library usage example and I need to search a bit if I can't find "smaller" use cases for those. One problem is that transpose is so expensive that you need to amortize it somehow for SIMD to remain worthwhile, so it's only useful inside of bigger algorithms. But lane shift may have smaller algorithms using it.
- Rotate I've frequently seen in SIMD libraries so I assumed it's common but I've never used it myself so I'll need to search a bit too :) Or just drop that one.