binaryen icon indicating copy to clipboard operation
binaryen copied to clipboard

CSE for LEA-like operations

Open MaxGraey opened this issue 3 years ago • 11 comments

Currently binaryen can't do CSE for such cases:

i32.load(p + ((i + 0) << 2)
i32.load(p + ((i + 1) << 2)
i32.load(p + ((i + 2) << 2)

Which usually appear when we have array's access like:

p[i + 0]
p[i + 1]
p[i + 2]

I'm wondering how better to tech binaryen handle such cases and optimize it to this?

let pi = p + (i << 2)
i32.load(pi + (0 << 2))
i32.load(pi + (1 << 2))
i32.load(pi + (2 << 2))

Probably the best approach is always distribute (x op1 C1) op2 C2 to ((x op2 C2) op1 (C1 op2 C2)) in optimization instructions? WDYT?

MaxGraey avatar Jul 12 '22 12:07 MaxGraey

Probably the best approach is always distribute (x op1 C1) op2 C2 to ((x op2 C2) op1 (C1 op2 C2)) in optimization instructions?

That would increase code size, though.

I'm not sure how best to handle this. But at least the p + .. part could be handled by LocalCSE, though atm it starts from the leaves so any difference stops it. It would need to scan from the parents and allow there to be different children.

kripken avatar Jul 12 '22 16:07 kripken

That would increase code size, though.

Yes, that's may increase size for general case. But for load / store it actually may decrease size. For example:

i32.load(p + ((i + 127) << 2)

output (17 bytes):

  local.get $ptr
  local.get $idx
  i32.const 127
  i32.add
  i32.const 2
  i32.shl
  i32.add
  i32.load

Optimized:

i32.load(p + (i << 2) + (127 << 2))

output (14 bytes):

  local.get $ptr
  local.get $idx
  i32.const 2
  i32.shl
  i32.add
  i32.load offset=508

MaxGraey avatar Jul 12 '22 17:07 MaxGraey

That example ends up optimized well because one of them is a constant. We do have handling for that already,

https://github.com/WebAssembly/binaryen/blob/2ae1a61ae4d3b23e5f7feaf4c7daf0ce72ffbb92/src/passes/OptimizeInstructions.cpp#L2470-L2473

(but if you find it doesn't work on that example perhaps it has a bug).

kripken avatar Jul 13 '22 15:07 kripken

Yeah, it seems it doesn't handle properly. Here example:

(module
 (memory $0 0)
 (export "test" (func $test))
 (export "memory" (memory $0))
 (func $test (param $0 i32) (param $1 i32) (result i32)
  local.get $0
  local.get $1
  i32.const 1
  i32.add
  i32.const 2
  i32.shl
  i32.add
  i32.load
  local.get $0
  local.get $1
  i32.const 2
  i32.add
  i32.const 2
  i32.shl
  i32.add
  i32.load
  i32.add
 )
)

After: wasm-opt load_lea.wasm -O4 -o load_lea.opt.wasm got the same:

(module
  (type $t0 (func (param i32 i32) (result i32)))
  (func $test (export "test") (type $t0) (param $p0 i32) (param $p1 i32) (result i32)
    local.get $p0
    local.get $p1
    i32.const 1
    i32.add
    i32.const 2
    i32.shl
    i32.add
    i32.load
    local.get $p0
    local.get $p1
    i32.const 2
    i32.add
    i32.const 2
    i32.shl
    i32.add
    i32.load
    i32.add
  )
  (memory $memory (export "memory") 0)
)

MaxGraey avatar Jul 13 '22 15:07 MaxGraey

And I don't think optimizeAddedConstants handle our case. It seems it handle cases like that only:

((x + 1) * 3) * 2    ->    (x + 1) * 6
(5 + (i - 4)) + 7    ->     x + 8

Btw it can't handle mixed mul / lshift cases like:

(x << 5) * 3 
(x * 3) << 5 

only this:

(x << 3) << 5     ->   x << 8

MaxGraey avatar Jul 13 '22 15:07 MaxGraey

Good to know those limitations. It would be good to improve optimizeAddedConstants to handle them as well.

kripken avatar Jul 13 '22 17:07 kripken

Sure. But what about distribution optimization rules:

(x + C1) << C2    ->   (x << C2) + (C1 << C2)
(x - C1) << C2    ->   (x << C2) - (C1 << C2)
(x + C1) * C2     ->   (x * C2) + (C1 * C2)
(x - C1) * C2     ->   (x * C2) - (C1 * C2)

as separate rules which do this carefully:

  • always apply this rules for subexpressions inside load / store;
  • for rest cases apply it only if C1 << C2 or C1 * C2 doesn't increase size when shrinkLevel != 0
  • always apply if shrinkLevel == 0

I can make PR. WDYT?

MaxGraey avatar Jul 13 '22 17:07 MaxGraey

I'm not sure about the first and last bullet points. Why is this always better in a load/store?

But good point in the second bullet point: yes, if C1 and C2 are constants, and the result of multiplying/shifting does not increase the LEB size, then it seems worth doing.

kripken avatar Jul 26 '22 17:07 kripken

Why is this always better in a load/store?

I should clarify. It always better for load/store when C1 * C2 <= LowMemoryBound due to addition expr will optimized to offset attribute of load/store. But it also may be useful for optimize for speed because wasm engines may fold load/store + lea-like expressions to single mov with complex address arithmetic

MaxGraey avatar Jul 26 '22 18:07 MaxGraey

I see. Yes, that does make sense about low memory. I'm less sure about the engine side but that may be correct as well.

kripken avatar Jul 26 '22 19:07 kripken

I checked on SpiderMonkey:

(module
 (memory $0 1)
 (export "opt" (func $opt))
 (export "naive" (func $naive))
 (func $opt (param $ptr i32) (param $idx i32) (result i32)
  (i32.load offset=16
   (i32.add
    (local.get $ptr)
    (i32.shl
     (local.get $idx)
     (i32.const 2)
    )
   )
  )
 )
 (func $naive (param $ptr i32) (param $idx i32) (result i32)
  (i32.load
   (i32.add
    (local.get $ptr)
    (i32.shl
     (i32.add
      (local.get $idx)
      (i32.const 4)
     )
     (i32.const 2)
    )
   )
  )
 )
)

native output:

wasm-function[0]:  ; opt
  sub rsp, 8                            ; 0x000000 48 83 ec 08
  lea eax, dword ptr [rdi + rsi*4]      ; 0x000004 8d 04 b7
  mov eax, dword ptr [r15 + rax + 0x10] ; 0x000007 41 8b 44 07 10
  nop                                   ; 0x00000c 66 90
  add rsp, 8                            ; 0x00000e 48 83 c4 08
  ret                                   ; 0x000012 c3

wasm-function[1]:  ; naive
  sub rsp, 8                            ; 0x000000 48 83 ec 08
  add esi, 4                            ; 0x000004 83 c6 04
  lea eax, dword ptr [rdi + rsi*4]      ; 0x000007 8d 04 b7
  mov eax, dword ptr [r15 + rax]        ; 0x00000a 41 8b 04 07
  nop                                   ; 0x00000e 66 90
  add rsp, 8                            ; 0x000010 48 83 c4 08
  ret    

I think rest Wasm engines will produce a similar picture. So it is definitely worth for optimizing

MaxGraey avatar Jul 29 '22 06:07 MaxGraey