ecmascript_simd icon indicating copy to clipboard operation
ecmascript_simd copied to clipboard

Support integer mulHigh ?

Open flagxor opened this issue 8 years ago • 7 comments

Multiply returning the high 16-bits for int16x8 is supported by Intel (PMULHW). A potential SIMD.js customer interested in codecs came across the lack of this. What do folks think of adding it?

flagxor avatar Mar 07 '16 06:03 flagxor

@sunfishcode

johnmccutchan avatar Mar 07 '16 14:03 johnmccutchan

My big-picture view is that the current iteration of SIMD.js is primarily about delivering basic functionality and establishing how SIMD integrates with JS in general, since that's a fairly complex undertaking in itself, and that there will be many iterations of the spec in the future where we can add all the other features we want.

That said, this would be a useful feature to add in such an iteration. What are the implementation prospects on NEON?

sunfishcode avatar Mar 07 '16 16:03 sunfishcode

NEON (Aarch64 and Aarch32) provides signed and unsigned 'long' vector multiply instructions.

For example, the SMULL instruction multiplies two int16x4 vectors to produce a int32x4 result. It operates on the low 64 bits of the input vectors. There is a corresponding SMULL2 instruction which operates on the high 64 bit of the inputs.

There are instructions for {Int,Uint}{8,16,32} input lanes.

It is not clear that cloning the PMULHW / PMULHUW instructions is the best way of providing long multiplication support. It would be interesting to compare real code using SSE-style long multiplication vs NEON-style.

stoklund avatar Mar 07 '16 18:03 stoklund

Mul-high operations are often used in DSP without the need to capture the low part of the result, which means you can capture twice as many results in a single operation. ARM NEON offers VQDMULH and VQRDMULH, which differ from PMULHW in that they drop the bottom 15 bits rather than 16 bits. PMULHW seems to be unsigned 0.16 fixed-point arithmetic, while VQDMULH is signed 1.15 fixed-point arithmetic.

Best practice on NEON depends on what the final goal is; but to implement PMULHW blindly would require a temporary register, two VMULLs, and a VUZP (and at the end of that the temporary register would contain the corresponding low parts of the results).

ghost avatar Mar 07 '16 21:03 ghost

Hi All,

Thanks for the replies.

I'm not necessarily trying to change scope of the current version, but figure talking through integer use cases is relevant to Stage 3 "developer feedback". Though I can definitely see the case for pushing this to a later version.

It has been illuminating talking to a customer in the image/video space (will point James Bankoski at this thread to follow along), as it's made me want to double check that we have enough coverage of what DSP folks typically use to be useful out of the gate. James' application currently sees a slowdown in both firefox and chrome with SIMD.js on. Note, this is just leaning on emscripten + simd macros, nothing's been hand tuned yet.

To get a better handle on what's happening in his app, I fished out some of the DSP kernels from James' application (webp) to understand how mulHigh type operations are used. It seems like in portable code of this sort, the difference between PMULHW and VQDMULH is avoided by careful selection of the range of the inputs to avoid the difference in shift and overflow behavior. Also it tends to involve multiplying by constants. I'm not sure if there's a general pattern that can easily made to be shared between the two arches. If we can find one, it might be nice.

James is working on webp, source code here: https://chromium.googlesource.com/webm/libwebp/+/master

He's been experimenting with directly applying it via emscripten: https://chromium-review.googlesource.com/#/c/43081/

There are 2 main locations I've found that use PMULHW / VQDMULH (in code that has sse + neon versions):

src/dsp/lossless_enc*.c

I actually started down the path of posting a comparison of the TransformColor function from lossless_enc*.c + a SIMD.js version, only to realize that with a different approach, it can probably be expressed without mulHigh. It scales a bunch of RGB values doing some 8-bit x 8-bit multiplies (high word preserving), followed by 16-bit adds/subs of the results. It's neon and sse versions end up obsuring the multiply, making them look like more than 16-bits of result are used, but I've convinced myself this one could be done ok with SIMD.js as is.

src/dsp/dec_.c + src/dsp/enc_.c

These two have transforms that are the mirror image of each other. They use a bunch of multiplication by 1.15 fixed point constants. For SSE, they use the unsigned 0.16 multiply of PMULHW by doing: x * K as (x * (K - 1.0)) + x as they happen to have constants >1. A possible pattern to be gleaned here, is that for some values we can produce good code on both arches by picking a different value to multiply by (and requiring a range that doesn't overflow).

Possible strawman. A fixed point multiply limited to constant shift + multiplication value?

SIMD.Int16x8.fixedPointConstMul(x, Av, B) --> ((int32x8) x * Av) >> B Av(int16x8) and B must be constant to get performance (like swizzle/shuffle). Av + B must have values that avoid overflow. If we want to rule out all the performance deadspots on all arches, we might have to limit the range of Av and B a little more, but I haven't quite thought through the constraint.

What do folks think?

Does this seem at all in scope for v1?

flagxor avatar Mar 13 '16 21:03 flagxor

It seems out of scope for v1 to me.

Some people close to TC39 have encouraged me that future iterations of SIMD.js should relatively simple and quick to standardize, because v1 set up the major supporting infrastructure. Consequently, we shouldn't feel under pressure to rush features in.

sunfishcode avatar Mar 14 '16 18:03 sunfishcode

This seems like a good candidate for a .Universe operation (see: https://github.com/tc39/ecmascript_simd/blob/master/extended-api.md)

PeterJensen avatar Mar 15 '16 18:03 PeterJensen