design 64-bit Widening Multiplication

Currently, wasm doesn't have instructions to calculate the high bits of 64-bit widening multiplication. Is it possible to add it in the future?
For your information, LLVM emits a very slow routine to simulate it. And it seems unlikely to be optimized easily for those wasm JIT & AOT compilers.

Jan 19 '24 20:01 waterlens

Do you mean the Arm64 equivalent of umulh instruction? This would be a straightforward addition to the spec, because IIRC on x64 this would be just the regular mul instruction with a REX prefix. This is also potentially an optimization opportunity because doing the equivalent of a * (unsigned __int128)b >> 64; doesn't seem to generate efficient bytecode in the LLVM backend.

Jan 19 '24 21:01 dtig

@dtig Thank you for your comment!

Yes, umulh is exactly one of the instructions I proposed.

For the architecture support, I can give a simple summary here.

RV64M: mulh, mulhsu, mulhu
x86_64: mul, imul, mulx (with BMI2 extension)
ARMv8: umulh, smulh However, it's rare to provide such instructions on the 32-bit platform.

About adding them to the spec, there are several considerations.

Usefulness Positive. 64-bit Widening multiplication is very common in numeric calculation and cryptography code.
Architecture Support Neutral. It will be straightforward to support it on the 64-bit arch but hard on the 32-bit arch.
Necessity Positive. Currently, LLVM emits a __multi3 call to emulate the 128-bit multiplication for 64-bit widening multiplication, and wasm compilers like cranelift is problematic in optimizing it out, by bytecodealliance/wasmtime#4077. First, for 64-bit widening multiplication, we only expect i64*i64 -> i128, not i128*i128->i128 (__multi3). It not only emulates the calculation of high 64-bit, but also the low 64-bit. Inefficiency! Actually i128 * i128 -> i128 can be implemented using one i64*i64 -> i128 and two i64*i64 -> i64, and I have no idea why they don't directly emit the i64*i64 -> i128. It's a problem of LLVM, but it can never be a real concern on native 64-bit targets. However, it's indeed one for wasm. Second, wasm requires the function to return the result i128 through linear memory, which adds an extra layer and hides the information required by optimizers. The function is in a standalone library in some cases, which makes it worse because it's unlikely to inline the routine.

Considering those all, I suggest adding them to wasm spec as an extension. It could be expected that some implementations will happily implement it while others can ignore the extension.

Jan 20 '24 07:01 waterlens

Shall we draft a proposal for it?

Jan 20 '24 08:01 waterlens

Sorry I lost track of this issue, I think drafting a proposal is a good next step - though it might also be a good question for the CG or the SIMD subgroup to see if there's other operations that we can include as we don't have a precedent for introducing one-off operations, but there's probably a class of operations we could consider. Here is some documentation on how to get a proposal started, and in parallel this could also be a good topic for discussion at a future SIMD subgroup meeting. @penzn are the subgroup meetings still running?

Mar 13 '24 21:03 dtig

Thank you for your time. Yeah, it could be better to discuss it somewhere. I would like to hear some opinions from other people. I saw some people misunderstood this as trying to add the instruction directly into the core instr set, but it's not correct. Actually, I'm proposing this as an extension. Because it might be hard to do the 64-bit widening multiplication on some platforms.

Mar 19 '24 17:03 waterlens