flexible-vectors
flexible-vectors copied to clipboard
Broadcast lane
Proposed in #28 (originally #27). It is different from existing splat, since it broadcasts a lane from input, rather than a scalar, also takes an index to select which element to broadcast:
Gets a single lane from vector and broadcast it to the entire vector.
idx
is interpreted modulo the cardinal of the vector.
vec.v8.splat_lane(v: vec.v8, idx: i32) -> vec.v8
vec.v16.splat_lane(v: vec.v16, idx: i32) -> vec.v16
vec.v32.splat_lane(v: vec.v32, idx: i32) -> vec.v32
vec.v64.splat_lane(v: vec.v64, idx: i32) -> vec.v64
vec.v128.splat_lane(v: vec.v128, idx: i32) -> vec.v128
On x86 broadcast instructions first appear in AVX (32-bit floating point elements, AVX2 for integers), however x86 variants don't take an index and only broadcasts first element of the source. General-purpose shuffle would need to be used to emulate this on SSE, which is not great (definitely slower than specialized version). Also, taking an index would lead to this turning into a general purpose shuffle on AVX+ as well.
It should be noted that vec.vX.splat_lane
is not a substitute to vec.S.splat
(which takes a scalar). It is a complement.
Yes, in general, vec.vX.splat_lane
would be implemented using 1-input shuffles.
I have 2 use cases for this instruction:
- prefix sums (to propagate the last value to the next iteration)
- matrix multiplication (to save some loads)
This operation is not that easy to implement with Neon and SVE either because of the variable index. It is basically equivalent to extract_lane
, followed by an indexed DUP
(index 0), for example. If the index is known at compile time and in the case of SVE has a suitable value (fits within the first 16 bytes), then it can be reduced to just the DUP
.
Actually, x86 SIMD has even more limitations, but overall, I don't think it's that bad.
First, I assume that WASM engines will actually do proper constant folding and thus detect when the index is "compile-time" known. This is not quite pattern matching, though (and is actually simpler and more robust, I assume).
Also, for 8-bit types, it is also not that hard to implement on all archs: DUP the index into a vector, and use this vector in a TBL. For larger types, a bit more index manipulation is required because TBL (and equivalents) have a byte granularity.
Also, for 8-bit types, it is also that hard to implement on all archs: DUP the index into a vector, and use this vector in a TBL.
I am not sure if in that case I am comfortable with the idea of using TBL
for SVE (or a transfer from a general-purpose to a vector register), so what I have in mind is:
cntb x1
udiv w2, w0, w1
msub w1, w1, w2, w0
whilelo p0.b, wzr, w1
lasta b1, p0, z0.b
dup z1.b, z1.b[0]
The first 3 instructions would be the same for the TBL
approach as well (they take care of getting the index into the proper range).
Oh, I missed that lasta
could put the result into vector (first lane I assume). Then yes, it is probably a good way to do it.
However, this trick is only doable in SVE, and certainly not on x86. So all in all, the "TBL"-like implementation will still be good for those archs.
Also, as cntb
is constant during the execution, it could be hardcoded during the translation, and for a power of 2, those 3 instructions will become one.
I don't assume JIT compilation in my analysis and I use vector length-agnostic code generation - in fact, AOT compilation of WebAssembly doesn't sound like something completely out of the ordinary (still not common, though). Yes, that's arguably quite conservative, but it's usually straightforward to simplify once the assumptions are known.