binaryen [OptimizeInstructions] Optimize copysign operations

copysign(x, +C)  ->  abs(x);
copysign(x, -C)  ->  neg(abs(x));
copysign(x,  x)  ->  x

Sep 10 '22 10:09 MaxGraey

I think the first and last rules make sense, but as discussed in the other issue, the one that generates two operations from one seems risky. We'd need to confirm that all major VMs already optimize those two operations together. It might be a good idea to leave it as a TODO for now if we are not sure.

Sep 12 '22 17:09 kripken

Here wasm dumps for v8 (turbofun) for copysign(x, -C) and neg(abs(x)):

index: 0
kind: wasm function
compiler: TurboFan
Body (size = 128 = 80 + 48 padding)
Instructions (size = 72)
0x13277680680     0  55                   push rbp
0x13277680681     1  4889e5               REX.W movq rbp,rsp
0x13277680684     4  6a08                 push 0x8
0x13277680686     6  56                   push rsi

0x13277680687     7  49ba00000000000090c0 REX.W movq r10,0xc090000000000000
0x13277680691    11  c4c1f96ec2           vmovq xmm0,r10
0x13277680696    16  c4e1f97ec0           vmovq rax,xmm0
0x1327768069b    1b  c4e1f97ecb           vmovq rbx,xmm1
0x132776806a0    20  48ba0000000000000080 REX.W movq rdx,0x8000000000000000
0x132776806aa    2a  4823d0               REX.W andq rdx,rax
0x132776806ad    2d  48b8ffffffffffffff7f REX.W movq rax,0x7fffffffffffffff
0x132776806b7    37  4823c3               REX.W andq rax,rbx
0x132776806ba    3a  480bd0               REX.W orq rdx,rax
0x132776806bd    3d  c4e1f96eca           vmovq xmm1,rdx

0x132776806c2    42  488be5               REX.W movq rsp,rbp
0x132776806c5    45  5d                   pop rbp
0x132776806c6    46  c3                   retl
0x132776806c7    47  90                   nop

Safepoints (entries = 0, byte size = 8)

RelocInfo (size = 0)

--- End code ---

--- WebAssembly code ---
name: wasm-function[1]
index: 1
kind: wasm function
compiler: TurboFan
Body (size = 64 = 52 + 12 padding)
Instructions (size = 44)
0x13277680700     0  55                   push rbp
0x13277680701     1  4889e5               REX.W movq rbp,rsp
0x13277680704     4  6a08                 push 0x8
0x13277680706     6  56                   push rsi

0x13277680707     7  49ba306e2d0c01000000 REX.W movq r10,0x10c2d6e30    ;; external reference (double_absolute_constant)
0x13277680711    11  c4c1705402           vandps xmm0,xmm1,[r10]
0x13277680716    16  49ba406e2d0c01000000 REX.W movq r10,0x10c2d6e40    ;; external reference (double_negate_constant)
0x13277680720    20  c4c178570a           vxorps xmm1,xmm0,[r10]

0x13277680725    25  488be5               REX.W movq rsp,rbp
0x13277680728    28  5d                   pop rbp
0x13277680729    29  c3                   retl
0x1327768072a    2a  90                   nop
0x1327768072b    2b  90                   nop

As you can see even without optimization to single command it much smaller (in 2.5x) and faster (see second func wasm-function[1]).

Sample: copysign_vs_nabs.wasm.zip

Sep 12 '22 17:09 MaxGraey

@kripken Apparently, it doesn't matter if engines can optimize nabs to a single instruction, it's still faster as we can see here: https://github.com/WebAssembly/binaryen/pull/5032#issuecomment-1244069071

Sep 13 '22 15:09 MaxGraey

Interesting, thanks for the info @MaxGraey

@jakobkummerow , perhaps you can advise here? The CPU details above are beyond my knowledge. The question is whether Binaryen should do this rule:

copysign(x, -C)  ->  neg(abs(x));       (where C is a constant)

That turns one instruction into two, which worries me a little, but @MaxGraey has some data above that suggests it might be ok.

Sep 13 '22 17:09 kripken

As the assembly dumps show, the number of Wasm instructions is a very weak predictor for the amount of machine code that'll be generated.

Performance-wise, I have no concerns about applying this optimization in Binaryen. I also don't think it'll matter much; but then maybe I just haven't seen the cases where it does matter. I also think that it would be quite feasible to apply the same optimization in V8, if we wanted to (e.g. because we encountered a case where it mattered). If we did that, then I think copysign(x, -C) (understood as "setsign(x, negative)") might be optimizable to a slightly better code sequence than neg(abs(x)), but again it's unclear to me whether the difference would matter.

My biggest concern would be verifying that the respective instruction sequences are indeed indentical for the corner cases (NaNs, infinities, and negative zero used as either input).

Sep 15 '22 11:09 jakobkummerow

It makes sense due to neg(abs(x)) could be optimized to single instruction on most architectures. See this: https://github.com/bytecodealliance/wasmtime/issues/4803

Also on x64 and AVX-512 copysign could be lowered to single inst as well:

mov 0x7fffffff, %r10d
vmovd %r10d, %xmm3
vpternlogd $0xE4, %xmm3, %xmm1, %xmm0

Sep 15 '22 11:09 MaxGraey

Ah right, good point that an engine could fold neg(abs(x)) just as easily as it could fold copysign(x, -C).

Sep 15 '22 18:09 jakobkummerow

@kripken Since we found out that even in the current situation with wasm engines we get a more optimal codogen in any way, can we merge that?

Sep 19 '22 20:09 MaxGraey

binaryen binaryen copied to clipboard

[OptimizeInstructions] Optimize copysign operations

binaryen
binaryen copied to clipboard