relaxed-simd
relaxed-simd copied to clipboard
relaxed i8x16.swizzle
- What are the instructions being proposed?
relaxed i8x16.swizzle
- What are the semantics of these instructions?
relaxed i8x16.swizzle(a, s)
selects lanes from a
using indices in s
, indices in the range [0,15]
will select the i
-th element of a
, the result for any out of range indices is implementation-defined (i.e. if the index is [16-255]
.
- How will these instructions be implemented? Give examples for at least x86-64 and ARM64. Also provide reference implementation in terms of 128-bit Wasm SIMD.
x86/64, pshufb
, out of range indices will return different results:
- if top bit of index is set, return 0
- else select the
i % 16
-th element
ARM/ARM64, vtbl
and tbl
, out of range indices return 0.
RISC-V V vrgather.vv a, b
, out of range return 0 (assuming VEW set to 8, LMUL set to 1, VLEN set to 128, so VLMAX = 16).
Simd128, i8x16.swizzle
- How does behavior differ across processors? What new fingerprinting surfaces will be exposed?
Difference between x86/64 and ARM/ARM64
- What use cases are there?
Swizzle is quite a common operation, e.g. used in multiple places in meshoptimizer.
On PPC, vpermr
(the vec_perm
intrinsic) could be used for this. It actually takes two input vectors (plus the index vector) and only the lower 5 bits are used for each index, but if you pass the same vector for both inputs effectively you get the i % 16
behavior.
On z/Arch there is vperm
/vec_perm()
, which works the same.
This instruction is straightforward and was used as an example motivator for the relaxed-simd proposal itself. One question that comes to mind though is the mechanism for enabling? I think this has been discussed before but how would we be expected to enable specific instructions to be their relaxed version while others remain unrelaxed?
This instruction is straightforward and was used as an example motivator for the relaxed-simd proposal itself. One question that comes to mind though is the mechanism for enabling? I think this has been discussed before but how would we be expected to enable specific instructions to be their relaxed version while others remain unrelaxed?
We will not enable an existing instruction to be executed in a relaxed manner. The relaxed instruction will be a completely new instruction with different opcode.
Yes, so then you could have a module that has both swizzle and relaxed swizzle instructions? What I am wondering then is if I am writing code in C that is auto-vectorized for example, is there expected to be a way to specify this to the compiler that's targeting Wasm?
Yes, so then you could have a module that has both swizzle and relaxed swizzle instructions?
Yup that is possible.
is there expected to be a way to specify this to the compiler that's targeting Wasm?
Not at the moment. Maybe we can introduce an Emscripten flag to do this, similar to the -msimd128
currently, that will emit relaxed i8x16.swizzle instead instead of i8x16.swizzle.
Yes, a flag makes sense. In fact I imagine with the proper dependence analysis the compiler could figure out if it is safe to use the relaxed version of an instruction. In fact perhaps it should be criteria or go into the thinking/motivation of proposing a relaxed instruction .. that with a compiler flag giving permission and proper analysis a compiler could determine when it is safe to generate the relaxed version.
In fact I imagine with the proper dependence analysis the compiler could figure out if it is safe to use the relaxed version of an instruction.
Good idea, but likely not possible in the most general case. E.g. if the swizzle depends on a mutable global/imported value,
if I am writing code in C that is auto-vectorized for example
I don't expect that compiler would be able to generate either the normal i8x16.swizzle
or a related one from auto-vectorized code.
Note: vtbl is not available on ARM v8-M MVE AFAICT.
RISC-V V has vrgather
which returns 0 for out of bounds.
For Power, likely require vperm with shift left on the selection vector (vperm uses bits 3:7 of each byte of selection), then it will select modulo 16.