ecmascript_simd
ecmascript_simd copied to clipboard
Support integer mulHigh ?
Multiply returning the high 16-bits for int16x8 is supported by Intel (PMULHW). A potential SIMD.js customer interested in codecs came across the lack of this. What do folks think of adding it?
@sunfishcode
My big-picture view is that the current iteration of SIMD.js is primarily about delivering basic functionality and establishing how SIMD integrates with JS in general, since that's a fairly complex undertaking in itself, and that there will be many iterations of the spec in the future where we can add all the other features we want.
That said, this would be a useful feature to add in such an iteration. What are the implementation prospects on NEON?
NEON (Aarch64 and Aarch32) provides signed and unsigned 'long' vector multiply instructions.
For example, the SMULL
instruction multiplies two int16x4 vectors to produce a int32x4 result. It operates on the low 64 bits of the input vectors. There is a corresponding SMULL2
instruction which operates on the high 64 bit of the inputs.
There are instructions for {Int,Uint}{8,16,32} input lanes.
It is not clear that cloning the PMULHW
/ PMULHUW
instructions is the best way of providing long multiplication support. It would be interesting to compare real code using SSE-style long multiplication vs NEON-style.
Mul-high operations are often used in DSP without the need to capture the low part of the result, which means you can capture twice as many results in a single operation. ARM NEON offers VQDMULH and VQRDMULH, which differ from PMULHW in that they drop the bottom 15 bits rather than 16 bits. PMULHW seems to be unsigned 0.16 fixed-point arithmetic, while VQDMULH is signed 1.15 fixed-point arithmetic.
Best practice on NEON depends on what the final goal is; but to implement PMULHW blindly would require a temporary register, two VMULLs, and a VUZP (and at the end of that the temporary register would contain the corresponding low parts of the results).
Hi All,
Thanks for the replies.
I'm not necessarily trying to change scope of the current version, but figure talking through integer use cases is relevant to Stage 3 "developer feedback". Though I can definitely see the case for pushing this to a later version.
It has been illuminating talking to a customer in the image/video space (will point James Bankoski at this thread to follow along), as it's made me want to double check that we have enough coverage of what DSP folks typically use to be useful out of the gate. James' application currently sees a slowdown in both firefox and chrome with SIMD.js on. Note, this is just leaning on emscripten + simd macros, nothing's been hand tuned yet.
To get a better handle on what's happening in his app, I fished out some of the DSP kernels from James' application (webp) to understand how mulHigh type operations are used. It seems like in portable code of this sort, the difference between PMULHW and VQDMULH is avoided by careful selection of the range of the inputs to avoid the difference in shift and overflow behavior. Also it tends to involve multiplying by constants. I'm not sure if there's a general pattern that can easily made to be shared between the two arches. If we can find one, it might be nice.
James is working on webp, source code here: https://chromium.googlesource.com/webm/libwebp/+/master
He's been experimenting with directly applying it via emscripten: https://chromium-review.googlesource.com/#/c/43081/
There are 2 main locations I've found that use PMULHW / VQDMULH (in code that has sse + neon versions):
src/dsp/lossless_enc*.c
I actually started down the path of posting a comparison of the TransformColor function from lossless_enc*.c + a SIMD.js version, only to realize that with a different approach, it can probably be expressed without mulHigh. It scales a bunch of RGB values doing some 8-bit x 8-bit multiplies (high word preserving), followed by 16-bit adds/subs of the results. It's neon and sse versions end up obsuring the multiply, making them look like more than 16-bits of result are used, but I've convinced myself this one could be done ok with SIMD.js as is.
src/dsp/dec_.c + src/dsp/enc_.c
These two have transforms that are the mirror image of each other. They use a bunch of multiplication by 1.15 fixed point constants. For SSE, they use the unsigned 0.16 multiply of PMULHW by doing: x * K as (x * (K - 1.0)) + x as they happen to have constants >1. A possible pattern to be gleaned here, is that for some values we can produce good code on both arches by picking a different value to multiply by (and requiring a range that doesn't overflow).
Possible strawman. A fixed point multiply limited to constant shift + multiplication value?
SIMD.Int16x8.fixedPointConstMul(x, Av, B) --> ((int32x8) x * Av) >> B Av(int16x8) and B must be constant to get performance (like swizzle/shuffle). Av + B must have values that avoid overflow. If we want to rule out all the performance deadspots on all arches, we might have to limit the range of Av and B a little more, but I haven't quite thought through the constraint.
What do folks think?
Does this seem at all in scope for v1?
It seems out of scope for v1 to me.
Some people close to TC39 have encouraged me that future iterations of SIMD.js should relatively simple and quick to standardize, because v1 set up the major supporting infrastructure. Consequently, we shouldn't feel under pressure to rush features in.
This seems like a good candidate for a .Universe operation (see: https://github.com/tc39/ecmascript_simd/blob/master/extended-api.md)