simd
simd copied to clipboard
Extended multiply horizontal add instruction
Introduction
This proposal introduces an extended horizontal multiply and add instruction that is used extensively in colorspace conversion and in the implementation of encoders and decoder for video processing. It mirrors the proposal @Maratyszcza put forth in #127 by adding an additional instruction for u8 -> i16 conversion. It maps to 3 instructions on ARM64, and 4 on ARMv7-a+neon. It's extremely similar to pmaddusbw that is supported on the Intel chipset, except that it's not signed by unsigned multiplication. This provides unsigned by unsigned multiplication.
Applications
Mapping to Common Instruction Sets
This section illustrates how the new WebAssembly instructions can be lowered on common instruction sets. However, these patterns are provided only for convenience, compliant WebAssembly implementations do not have to follow the same code generation patterns.
x86/x86-64 processors with AVX instruction set
- i16x8.dot_i8x16_u
- y = i16x8.dot_i8x16_u(x, y) is lowered to:
vmovdqa xmm2, [wasm_splat_i16x8(0x00ff)]
vpand xmm3, xmm_x, xmm2
vpand xmm2, xmm_y, xmm2
vpmullw xmm2, xmm2, xmm3
vpsrlw xmm_x, xmm_x, 8 # if register clobbering is a concern, replace the first operand
vpsrlw xmm_y, xmm_y, 8 # of each of these right shifts with a xmm temporary
vpmullw xmm_out, xmm_x, xmm_y
vpaddw xmm_out, xmm2, xmm_out
x86/x86-64 processors with SSE2 instruction set
- i16x8.dot_i8x16_u
- y = i16x8.dot_i8x16_u(x, y) is lowered to:
movdqa xmm3, [wasm_splat_i16x8(0x00ff)]
movdqa xmm2, xmm_x
pand xmm2, xmm3
pand xmm3, xmm_y
pmullw xmm2, xmm3
psrlw xmm_x, 8 # Use movdqa with xmm_x and xmm_y here
psrlw xmm_y, 8 # if it's unsafe to overwrite the input values.
pmullw xmm1, xmm_x
paddw xmm2, xmm_y
movdqa xmm_x, xmm2
ARM64 processors
- i16x8.dot_i8x16_u
- y = i16x8.dot_i8x16_u is lowered to:
umull v2.8h, v0.8b, v1.8b
umull2 v0.8h, v0.16b, v1.16b
addp v0.8h, v2.8h, v0.8h
ARMv7 processors with NEON instruction set
- i16x8.dot_i8x16_u
- y = i16x8.dot_i8x16_u(x, y) is lowered to:
vmull.u8 q10, d18, d17
vmull.u8 q8, d19, d16
vpadd.i16 d19, d20, d21
vpadd.i16 d18, d16, d17
SSSE3 lowering mismatch others. PMADDUBSW
does unsigned by signed multiplication.
SSSE3 lowering mismatch others.
PMADDUBSW
does unsigned by signed multiplication.
You're totally right. Nice catch.
@Maratyszcza I think that fixes it... I was stunned when I did the testing about how challenging pmaddusbw was to work with. It treats each operand differently such that unless both operands are guaranteed to be 127 or less, the ordering matters and the results will differ.
By analogy with #127, this instruction should be named i16x8.dot_i8x16_u
By analogy with #127, this instruction should be named
i16x8.dot_i8x16_u
Haven't forgotten about this. Will take care of it today.
This proposal is efficient on ARM64, but isn't efficient on x64. The original objective was to see if pmaddubsw
could be implemented portably since it would have provided an option to do the RGBA conversions on x64 chips in 1-2 ops. Unfortunately, the behavior of pmaddubsw isn't portable, and the workarounds required to get it to work efficiently, are less efficient than expanding the integer types, multiplying, and adding. As a side effect, I'd like to withdraw this proposal in favor of integer sign/zero extension proposal. If someone else has a need for this proposal before standardization, please write back here. For documentation on the issues with pmaddubsw, please see this thread on stackoverflow