binaryen
binaryen copied to clipboard
[OptimizeInstructions] Optimize copysign operations
copysign(x, +C) -> abs(x);
copysign(x, -C) -> neg(abs(x));
copysign(x, x) -> x
I think the first and last rules make sense, but as discussed in the other issue, the one that generates two operations from one seems risky. We'd need to confirm that all major VMs already optimize those two operations together. It might be a good idea to leave it as a TODO for now if we are not sure.
Here wasm dumps for v8 (turbofun) for copysign(x, -C)
and neg(abs(x))
:
index: 0
kind: wasm function
compiler: TurboFan
Body (size = 128 = 80 + 48 padding)
Instructions (size = 72)
0x13277680680 0 55 push rbp
0x13277680681 1 4889e5 REX.W movq rbp,rsp
0x13277680684 4 6a08 push 0x8
0x13277680686 6 56 push rsi
0x13277680687 7 49ba00000000000090c0 REX.W movq r10,0xc090000000000000
0x13277680691 11 c4c1f96ec2 vmovq xmm0,r10
0x13277680696 16 c4e1f97ec0 vmovq rax,xmm0
0x1327768069b 1b c4e1f97ecb vmovq rbx,xmm1
0x132776806a0 20 48ba0000000000000080 REX.W movq rdx,0x8000000000000000
0x132776806aa 2a 4823d0 REX.W andq rdx,rax
0x132776806ad 2d 48b8ffffffffffffff7f REX.W movq rax,0x7fffffffffffffff
0x132776806b7 37 4823c3 REX.W andq rax,rbx
0x132776806ba 3a 480bd0 REX.W orq rdx,rax
0x132776806bd 3d c4e1f96eca vmovq xmm1,rdx
0x132776806c2 42 488be5 REX.W movq rsp,rbp
0x132776806c5 45 5d pop rbp
0x132776806c6 46 c3 retl
0x132776806c7 47 90 nop
Safepoints (entries = 0, byte size = 8)
RelocInfo (size = 0)
--- End code ---
--- WebAssembly code ---
name: wasm-function[1]
index: 1
kind: wasm function
compiler: TurboFan
Body (size = 64 = 52 + 12 padding)
Instructions (size = 44)
0x13277680700 0 55 push rbp
0x13277680701 1 4889e5 REX.W movq rbp,rsp
0x13277680704 4 6a08 push 0x8
0x13277680706 6 56 push rsi
0x13277680707 7 49ba306e2d0c01000000 REX.W movq r10,0x10c2d6e30 ;; external reference (double_absolute_constant)
0x13277680711 11 c4c1705402 vandps xmm0,xmm1,[r10]
0x13277680716 16 49ba406e2d0c01000000 REX.W movq r10,0x10c2d6e40 ;; external reference (double_negate_constant)
0x13277680720 20 c4c178570a vxorps xmm1,xmm0,[r10]
0x13277680725 25 488be5 REX.W movq rsp,rbp
0x13277680728 28 5d pop rbp
0x13277680729 29 c3 retl
0x1327768072a 2a 90 nop
0x1327768072b 2b 90 nop
As you can see even without optimization to single command it much smaller (in 2.5x) and faster (see second func wasm-function[1]
).
Sample: copysign_vs_nabs.wasm.zip
@kripken Apparently, it doesn't matter if engines can optimize nabs to a single instruction, it's still faster as we can see here: https://github.com/WebAssembly/binaryen/pull/5032#issuecomment-1244069071
Interesting, thanks for the info @MaxGraey
@jakobkummerow , perhaps you can advise here? The CPU details above are beyond my knowledge. The question is whether Binaryen should do this rule:
copysign(x, -C) -> neg(abs(x)); (where C is a constant)
That turns one instruction into two, which worries me a little, but @MaxGraey has some data above that suggests it might be ok.
As the assembly dumps show, the number of Wasm instructions is a very weak predictor for the amount of machine code that'll be generated.
Performance-wise, I have no concerns about applying this optimization in Binaryen.
I also don't think it'll matter much; but then maybe I just haven't seen the cases where it does matter.
I also think that it would be quite feasible to apply the same optimization in V8, if we wanted to (e.g. because we encountered a case where it mattered). If we did that, then I think copysign(x, -C)
(understood as "setsign(x, negative)") might be optimizable to a slightly better code sequence than neg(abs(x))
, but again it's unclear to me whether the difference would matter.
My biggest concern would be verifying that the respective instruction sequences are indeed indentical for the corner cases (NaNs, infinities, and negative zero used as either input).
It makes sense due to neg(abs(x))
could be optimized to single instruction on most architectures. See this: https://github.com/bytecodealliance/wasmtime/issues/4803
Also on x64 and AVX-512 copysign
could be lowered to single inst as well:
mov 0x7fffffff, %r10d
vmovd %r10d, %xmm3
vpternlogd $0xE4, %xmm3, %xmm1, %xmm0
Ah right, good point that an engine could fold neg(abs(x))
just as easily as it could fold copysign(x, -C)
.
@kripken Since we found out that even in the current situation with wasm engines we get a more optimal codogen in any way, can we merge that?