64-bit Widening Multiplication
Currently, wasm doesn't have instructions to calculate the high bits of 64-bit widening multiplication. Is it possible to add it in the future?
For your information, LLVM emits a very slow routine to simulate it. And it seems unlikely to be optimized easily for those wasm JIT & AOT compilers.
Do you mean the Arm64 equivalent of umulh instruction? This would be a straightforward addition to the spec, because IIRC on x64 this would be just the regular mul instruction with a REX prefix. This is also potentially an optimization opportunity because doing the equivalent of a * (unsigned __int128)b >> 64; doesn't seem to generate efficient bytecode in the LLVM backend.
@dtig Thank you for your comment!
Yes, umulh is exactly one of the instructions I proposed.
For the architecture support, I can give a simple summary here.
- RV64M:
mulh,mulhsu,mulhu - x86_64:
mul,imul,mulx(with BMI2 extension) - ARMv8:
umulh,smulhHowever, it's rare to provide such instructions on the 32-bit platform.
About adding them to the spec, there are several considerations.
- Usefulness Positive. 64-bit Widening multiplication is very common in numeric calculation and cryptography code.
- Architecture Support Neutral. It will be straightforward to support it on the 64-bit arch but hard on the 32-bit arch.
- Necessity
Positive. Currently, LLVM emits a
__multi3call to emulate the 128-bit multiplication for 64-bit widening multiplication, and wasm compilers likecraneliftis problematic in optimizing it out, by bytecodealliance/wasmtime#4077. First, for 64-bit widening multiplication, we only expecti64*i64 -> i128, noti128*i128->i128(__multi3). It not only emulates the calculation of high 64-bit, but also the low 64-bit. Inefficiency! Actuallyi128 * i128 -> i128can be implemented using onei64*i64 -> i128and twoi64*i64 -> i64, and I have no idea why they don't directly emit thei64*i64 -> i128. It's a problem of LLVM, but it can never be a real concern on native 64-bit targets. However, it's indeed one for wasm. Second, wasm requires the function to return the resulti128through linear memory, which adds an extra layer and hides the information required by optimizers. The function is in a standalone library in some cases, which makes it worse because it's unlikely to inline the routine.
Considering those all, I suggest adding them to wasm spec as an extension. It could be expected that some implementations will happily implement it while others can ignore the extension.
Shall we draft a proposal for it?
Sorry I lost track of this issue, I think drafting a proposal is a good next step - though it might also be a good question for the CG or the SIMD subgroup to see if there's other operations that we can include as we don't have a precedent for introducing one-off operations, but there's probably a class of operations we could consider. Here is some documentation on how to get a proposal started, and in parallel this could also be a good topic for discussion at a future SIMD subgroup meeting. @penzn are the subgroup meetings still running?
Thank you for your time. Yeah, it could be better to discuss it somewhere. I would like to hear some opinions from other people. I saw some people misunderstood this as trying to add the instruction directly into the core instr set, but it's not correct. Actually, I'm proposing this as an extension. Because it might be hard to do the 64-bit widening multiplication on some platforms.