binaryen
binaryen copied to clipboard
CSE for LEA-like operations
Currently binaryen can't do CSE for such cases:
i32.load(p + ((i + 0) << 2)
i32.load(p + ((i + 1) << 2)
i32.load(p + ((i + 2) << 2)
Which usually appear when we have array's access like:
p[i + 0]
p[i + 1]
p[i + 2]
I'm wondering how better to tech binaryen handle such cases and optimize it to this?
let pi = p + (i << 2)
i32.load(pi + (0 << 2))
i32.load(pi + (1 << 2))
i32.load(pi + (2 << 2))
Probably the best approach is always distribute (x op1 C1) op2 C2 to ((x op2 C2) op1 (C1 op2 C2)) in optimization instructions? WDYT?
Probably the best approach is always distribute (x op1 C1) op2 C2 to ((x op2 C2) op1 (C1 op2 C2)) in optimization instructions?
That would increase code size, though.
I'm not sure how best to handle this. But at least the p + .. part could be handled by LocalCSE, though atm it starts from the leaves so any difference stops it. It would need to scan from the parents and allow there to be different children.
That would increase code size, though.
Yes, that's may increase size for general case. But for load / store it actually may decrease size. For example:
i32.load(p + ((i + 127) << 2)
output (17 bytes):
local.get $ptr
local.get $idx
i32.const 127
i32.add
i32.const 2
i32.shl
i32.add
i32.load
Optimized:
i32.load(p + (i << 2) + (127 << 2))
output (14 bytes):
local.get $ptr
local.get $idx
i32.const 2
i32.shl
i32.add
i32.load offset=508
That example ends up optimized well because one of them is a constant. We do have handling for that already,
https://github.com/WebAssembly/binaryen/blob/2ae1a61ae4d3b23e5f7feaf4c7daf0ce72ffbb92/src/passes/OptimizeInstructions.cpp#L2470-L2473
(but if you find it doesn't work on that example perhaps it has a bug).
Yeah, it seems it doesn't handle properly. Here example:
(module
(memory $0 0)
(export "test" (func $test))
(export "memory" (memory $0))
(func $test (param $0 i32) (param $1 i32) (result i32)
local.get $0
local.get $1
i32.const 1
i32.add
i32.const 2
i32.shl
i32.add
i32.load
local.get $0
local.get $1
i32.const 2
i32.add
i32.const 2
i32.shl
i32.add
i32.load
i32.add
)
)
After: wasm-opt load_lea.wasm -O4 -o load_lea.opt.wasm got the same:
(module
(type $t0 (func (param i32 i32) (result i32)))
(func $test (export "test") (type $t0) (param $p0 i32) (param $p1 i32) (result i32)
local.get $p0
local.get $p1
i32.const 1
i32.add
i32.const 2
i32.shl
i32.add
i32.load
local.get $p0
local.get $p1
i32.const 2
i32.add
i32.const 2
i32.shl
i32.add
i32.load
i32.add
)
(memory $memory (export "memory") 0)
)
And I don't think optimizeAddedConstants handle our case. It seems it handle cases like that only:
((x + 1) * 3) * 2 -> (x + 1) * 6
(5 + (i - 4)) + 7 -> x + 8
Btw it can't handle mixed mul / lshift cases like:
(x << 5) * 3
(x * 3) << 5
only this:
(x << 3) << 5 -> x << 8
Good to know those limitations. It would be good to improve optimizeAddedConstants to handle them as well.
Sure. But what about distribution optimization rules:
(x + C1) << C2 -> (x << C2) + (C1 << C2)
(x - C1) << C2 -> (x << C2) - (C1 << C2)
(x + C1) * C2 -> (x * C2) + (C1 * C2)
(x - C1) * C2 -> (x * C2) - (C1 * C2)
as separate rules which do this carefully:
- always apply this rules for subexpressions inside
load/store; - for rest cases apply it only if
C1 << C2orC1 * C2doesn't increase size whenshrinkLevel != 0 - always apply if
shrinkLevel == 0
I can make PR. WDYT?
I'm not sure about the first and last bullet points. Why is this always better in a load/store?
But good point in the second bullet point: yes, if C1 and C2 are constants, and the result of multiplying/shifting does not increase the LEB size, then it seems worth doing.
Why is this always better in a load/store?
I should clarify. It always better for load/store when C1 * C2 <= LowMemoryBound due to addition expr will optimized to offset attribute of load/store. But it also may be useful for optimize for speed because wasm engines may fold load/store + lea-like expressions to single mov with complex address arithmetic
I see. Yes, that does make sense about low memory. I'm less sure about the engine side but that may be correct as well.
I checked on SpiderMonkey:
(module
(memory $0 1)
(export "opt" (func $opt))
(export "naive" (func $naive))
(func $opt (param $ptr i32) (param $idx i32) (result i32)
(i32.load offset=16
(i32.add
(local.get $ptr)
(i32.shl
(local.get $idx)
(i32.const 2)
)
)
)
)
(func $naive (param $ptr i32) (param $idx i32) (result i32)
(i32.load
(i32.add
(local.get $ptr)
(i32.shl
(i32.add
(local.get $idx)
(i32.const 4)
)
(i32.const 2)
)
)
)
)
)
native output:
wasm-function[0]: ; opt
sub rsp, 8 ; 0x000000 48 83 ec 08
lea eax, dword ptr [rdi + rsi*4] ; 0x000004 8d 04 b7
mov eax, dword ptr [r15 + rax + 0x10] ; 0x000007 41 8b 44 07 10
nop ; 0x00000c 66 90
add rsp, 8 ; 0x00000e 48 83 c4 08
ret ; 0x000012 c3
wasm-function[1]: ; naive
sub rsp, 8 ; 0x000000 48 83 ec 08
add esi, 4 ; 0x000004 83 c6 04
lea eax, dword ptr [rdi + rsi*4] ; 0x000007 8d 04 b7
mov eax, dword ptr [r15 + rax] ; 0x00000a 41 8b 04 07
nop ; 0x00000e 66 90
add rsp, 8 ; 0x000010 48 83 c4 08
ret
I think rest Wasm engines will produce a similar picture. So it is definitely worth for optimizing