binaryen CSE for LEA-like operations

Currently binaryen can't do CSE for such cases:

i32.load(p + ((i + 0) << 2)
i32.load(p + ((i + 1) << 2)
i32.load(p + ((i + 2) << 2)

Which usually appear when we have array's access like:

p[i + 0]
p[i + 1]
p[i + 2]

I'm wondering how better to tech binaryen handle such cases and optimize it to this?

let pi = p + (i << 2)
i32.load(pi + (0 << 2))
i32.load(pi + (1 << 2))
i32.load(pi + (2 << 2))

Probably the best approach is always distribute (x op1 C1) op2 C2 to ((x op2 C2) op1 (C1 op2 C2)) in optimization instructions? WDYT?

Jul 12 '22 12:07 MaxGraey

Probably the best approach is always distribute (x op1 C1) op2 C2 to ((x op2 C2) op1 (C1 op2 C2)) in optimization instructions?

That would increase code size, though.

I'm not sure how best to handle this. But at least the p + .. part could be handled by LocalCSE, though atm it starts from the leaves so any difference stops it. It would need to scan from the parents and allow there to be different children.

Jul 12 '22 16:07 kripken

That would increase code size, though.

Yes, that's may increase size for general case. But for load / store it actually may decrease size. For example:

i32.load(p + ((i + 127) << 2)

output (17 bytes):

  local.get $ptr
  local.get $idx
  i32.const 127
  i32.add
  i32.const 2
  i32.shl
  i32.add
  i32.load

Optimized:

i32.load(p + (i << 2) + (127 << 2))

output (14 bytes):

  local.get $ptr
  local.get $idx
  i32.const 2
  i32.shl
  i32.add
  i32.load offset=508

Jul 12 '22 17:07 MaxGraey

That example ends up optimized well because one of them is a constant. We do have handling for that already,

https://github.com/WebAssembly/binaryen/blob/2ae1a61ae4d3b23e5f7feaf4c7daf0ce72ffbb92/src/passes/OptimizeInstructions.cpp#L2470-L2473

(but if you find it doesn't work on that example perhaps it has a bug).

Jul 13 '22 15:07 kripken

Yeah, it seems it doesn't handle properly. Here example:

(module
 (memory $0 0)
 (export "test" (func $test))
 (export "memory" (memory $0))
 (func $test (param $0 i32) (param $1 i32) (result i32)
  local.get $0
  local.get $1
  i32.const 1
  i32.add
  i32.const 2
  i32.shl
  i32.add
  i32.load
  local.get $0
  local.get $1
  i32.const 2
  i32.add
  i32.const 2
  i32.shl
  i32.add
  i32.load
  i32.add
 )
)

After: wasm-opt load_lea.wasm -O4 -o load_lea.opt.wasm got the same:

(module
  (type $t0 (func (param i32 i32) (result i32)))
  (func $test (export "test") (type $t0) (param $p0 i32) (param $p1 i32) (result i32)
    local.get $p0
    local.get $p1
    i32.const 1
    i32.add
    i32.const 2
    i32.shl
    i32.add
    i32.load
    local.get $p0
    local.get $p1
    i32.const 2
    i32.add
    i32.const 2
    i32.shl
    i32.add
    i32.load
    i32.add
  )
  (memory $memory (export "memory") 0)
)

Jul 13 '22 15:07 MaxGraey

And I don't think optimizeAddedConstants handle our case. It seems it handle cases like that only:

((x + 1) * 3) * 2    ->    (x + 1) * 6
(5 + (i - 4)) + 7    ->     x + 8

Btw it can't handle mixed mul / lshift cases like:

(x << 5) * 3 
(x * 3) << 5

only this:

(x << 3) << 5     ->   x << 8

Jul 13 '22 15:07 MaxGraey

Good to know those limitations. It would be good to improve optimizeAddedConstants to handle them as well.

Jul 13 '22 17:07 kripken

Sure. But what about distribution optimization rules:

(x + C1) << C2    ->   (x << C2) + (C1 << C2)
(x - C1) << C2    ->   (x << C2) - (C1 << C2)
(x + C1) * C2     ->   (x * C2) + (C1 * C2)
(x - C1) * C2     ->   (x * C2) - (C1 * C2)

as separate rules which do this carefully:

always apply this rules for subexpressions inside load / store;
for rest cases apply it only if C1 << C2 or C1 * C2 doesn't increase size when shrinkLevel != 0
always apply if shrinkLevel == 0

I can make PR. WDYT?

Jul 13 '22 17:07 MaxGraey

I'm not sure about the first and last bullet points. Why is this always better in a load/store?

But good point in the second bullet point: yes, if C1 and C2 are constants, and the result of multiplying/shifting does not increase the LEB size, then it seems worth doing.

Jul 26 '22 17:07 kripken

Why is this always better in a load/store?

I should clarify. It always better for load/store when C1 * C2 <= LowMemoryBound due to addition expr will optimized to offset attribute of load/store. But it also may be useful for optimize for speed because wasm engines may fold load/store + lea-like expressions to single mov with complex address arithmetic

Jul 26 '22 18:07 MaxGraey

I see. Yes, that does make sense about low memory. I'm less sure about the engine side but that may be correct as well.

Jul 26 '22 19:07 kripken

I checked on SpiderMonkey:

(module
 (memory $0 1)
 (export "opt" (func $opt))
 (export "naive" (func $naive))
 (func $opt (param $ptr i32) (param $idx i32) (result i32)
  (i32.load offset=16
   (i32.add
    (local.get $ptr)
    (i32.shl
     (local.get $idx)
     (i32.const 2)
    )
   )
  )
 )
 (func $naive (param $ptr i32) (param $idx i32) (result i32)
  (i32.load
   (i32.add
    (local.get $ptr)
    (i32.shl
     (i32.add
      (local.get $idx)
      (i32.const 4)
     )
     (i32.const 2)
    )
   )
  )
 )
)

native output:

wasm-function[0]:  ; opt
  sub rsp, 8                            ; 0x000000 48 83 ec 08
  lea eax, dword ptr [rdi + rsi*4]      ; 0x000004 8d 04 b7
  mov eax, dword ptr [r15 + rax + 0x10] ; 0x000007 41 8b 44 07 10
  nop                                   ; 0x00000c 66 90
  add rsp, 8                            ; 0x00000e 48 83 c4 08
  ret                                   ; 0x000012 c3

wasm-function[1]:  ; naive
  sub rsp, 8                            ; 0x000000 48 83 ec 08
  add esi, 4                            ; 0x000004 83 c6 04
  lea eax, dword ptr [rdi + rsi*4]      ; 0x000007 8d 04 b7
  mov eax, dword ptr [r15 + rax]        ; 0x00000a 41 8b 04 07
  nop                                   ; 0x00000e 66 90
  add rsp, 8                            ; 0x000010 48 83 c4 08
  ret

I think rest Wasm engines will produce a similar picture. So it is definitely worth for optimizing

Jul 29 '22 06:07 MaxGraey

binaryen binaryen copied to clipboard

CSE for LEA-like operations

binaryen
binaryen copied to clipboard