relaxed-simd relaxed i8x16.swizzle

What are the instructions being proposed?

relaxed i8x16.swizzle

What are the semantics of these instructions?

relaxed i8x16.swizzle(a, s) selects lanes from a using indices in s, indices in the range [0,15] will select the i-th element of a, the result for any out of range indices is implementation-defined (i.e. if the index is [16-255].

How will these instructions be implemented? Give examples for at least x86-64 and ARM64. Also provide reference implementation in terms of 128-bit Wasm SIMD.

x86/64, pshufb, out of range indices will return different results:

if top bit of index is set, return 0
else select the i % 16-th element

ARM/ARM64, vtbl and tbl, out of range indices return 0.

RISC-V V vrgather.vv a, b, out of range return 0 (assuming VEW set to 8, LMUL set to 1, VLEN set to 128, so VLMAX = 16).

Simd128, i8x16.swizzle

How does behavior differ across processors? What new fingerprinting surfaces will be exposed?

Difference between x86/64 and ARM/ARM64

What use cases are there?

Swizzle is quite a common operation, e.g. used in multiple places in meshoptimizer.

Apr 19 '21 20:04 ngzhian

On PPC, vpermr (the vec_perm intrinsic) could be used for this. It actually takes two input vectors (plus the index vector) and only the lower 5 bits are used for each index, but if you pass the same vector for both inputs effectively you get the i % 16 behavior.

On z/Arch there is vperm/vec_perm(), which works the same.

Apr 19 '21 22:04 nemequ

This instruction is straightforward and was used as an example motivator for the relaxed-simd proposal itself. One question that comes to mind though is the mechanism for enabling? I think this has been discussed before but how would we be expected to enable specific instructions to be their relaxed version while others remain unrelaxed?

Apr 27 '21 06:04 jlb6740

This instruction is straightforward and was used as an example motivator for the relaxed-simd proposal itself. One question that comes to mind though is the mechanism for enabling? I think this has been discussed before but how would we be expected to enable specific instructions to be their relaxed version while others remain unrelaxed?

We will not enable an existing instruction to be executed in a relaxed manner. The relaxed instruction will be a completely new instruction with different opcode.

Apr 27 '21 18:04 ngzhian

Yes, so then you could have a module that has both swizzle and relaxed swizzle instructions? What I am wondering then is if I am writing code in C that is auto-vectorized for example, is there expected to be a way to specify this to the compiler that's targeting Wasm?

Apr 27 '21 21:04 jlb6740

Yes, so then you could have a module that has both swizzle and relaxed swizzle instructions?

Yup that is possible.

is there expected to be a way to specify this to the compiler that's targeting Wasm?

Not at the moment. Maybe we can introduce an Emscripten flag to do this, similar to the -msimd128 currently, that will emit relaxed i8x16.swizzle instead instead of i8x16.swizzle.

Apr 27 '21 21:04 ngzhian

Yes, a flag makes sense. In fact I imagine with the proper dependence analysis the compiler could figure out if it is safe to use the relaxed version of an instruction. In fact perhaps it should be criteria or go into the thinking/motivation of proposing a relaxed instruction .. that with a compiler flag giving permission and proper analysis a compiler could determine when it is safe to generate the relaxed version.

Apr 27 '21 21:04 jlb6740

In fact I imagine with the proper dependence analysis the compiler could figure out if it is safe to use the relaxed version of an instruction.

Good idea, but likely not possible in the most general case. E.g. if the swizzle depends on a mutable global/imported value,

Apr 27 '21 22:04 ngzhian

if I am writing code in C that is auto-vectorized for example

I don't expect that compiler would be able to generate either the normal i8x16.swizzle or a related one from auto-vectorized code.

Apr 28 '21 10:04 Maratyszcza

Note: vtbl is not available on ARM v8-M MVE AFAICT.

Nov 01 '21 18:11 ngzhian

RISC-V V has vrgather which returns 0 for out of bounds.

Nov 01 '21 19:11 ngzhian

For Power, likely require vperm with shift left on the selection vector (vperm uses bits 3:7 of each byte of selection), then it will select modulo 16.

Nov 01 '21 23:11 ngzhian

relaxed-simd relaxed-simd copied to clipboard

relaxed i8x16.swizzle

relaxed-simd
relaxed-simd copied to clipboard