binaryen icon indicating copy to clipboard operation
binaryen copied to clipboard

Superoptimizer (2022)

Open kripken opened this issue 1 year ago • 13 comments

Followup to #900 - this dusts off that code, gets it building and running, and adds some improvements.

As with #900 I'm not sure if this is worth landing, if we only run it once every few years, so the code is not polished.

I'm starting to run this now and will post results soon. Hopefully that can help prioritize work on peephole optimizations. I'll be running on some J2Wasm, Dart, and Go code that I happen to have, and which might be interesting as none of those use LLVM - that is my motivation for doing this now. In the past we only got a few % of speedup from superoptimization, but that was on LLVM output, so things may be different now.

If people have other files I should test on please let me know. Also you can run it yourself, just do wasm-analyze foo.wasm.

kripken avatar Aug 30 '22 20:08 kripken

f people have other files I should test on please let me know

Here's another pretty big file. This is the AssemblyScript compiler compiled by itself in release mode: as.release.wasm.zip

MaxGraey avatar Aug 30 '22 20:08 MaxGraey

Findings so far:

1.

(select
  (X)
  (i32.const 0)
  (X)
)
=>
(X)

1b.

(select
  (i32.const 0)
  (X)
  (f32.eq
    (X)
    (f32.const 0)
  )
)
=>
(X)

(can also happen with if)

2.

(i32.shl
 (i32.shr_u
  (local.get $0)
  (i32.const 3)
 )
 (i32.const 3)
)
 =->
(i32.and
 (local.get $0)
 (i32.const -8)
)

3.

(i32.and
 (i32.add
  (local.get $0)
  (i32.const 8)
 )
 (i32.const 7)
)
 =->
(i32.and
 (local.get $0)
 (i32.const 7)
)

4.

(i32.lt_s
 (i32.shr_s
  (local.get $0)
  (i32.const 1)
 )
 (i32.const 12)
)
 =->
(i32.lt_s
 (local.get $0)
 (i32.const 24)
)

5.

(i32.shl
 (i32.sub
  (i32.add
   (local.get $0)
   (i32.const 1)
  )
  (local.get $0)
 )
 (i32.const 2)
)
 =->
(i32.const 4)

surprised we don't get that one already..? maybe a superoptimizer bug. also:

(i32.sub
 (i32.add
  (local.get $0)
  (i32.const 1)
 )
 (local.get $0)
)
 =->
(i32.const 1)

6.

(i32.add
 (i32.mul
  (local.get $0)
  (i32.const 3)
 )
 (local.get $0)
)
 =->
(i32.shl
 (local.get $0)
 (i32.const 2)
)

7.

(i32.or
 (i32.lt_s
  (local.get $0)
  (i32.const 0)
 )
 (i32.ge_s
  (local.get $0)
  (i32.const 3)
 )
)
 =->
(i32.ge_u
 (local.get $0)
 (i32.const 3)
)

That's from J2Wasm and Dart. It ran very quickly on them, likely since it gives up whenever it seems a GC operation (it only optimizes numeric types). Still running on Go, which is MVP wasm so we can process more, and that will take some time.

kripken avatar Aug 30 '22 21:08 kripken

Thanks @MaxGraey , looking at that file, it immediately hits a bug in the superoptimizer and crashes...

kripken avatar Aug 30 '22 21:08 kripken

it immediately hits a bug in the superoptimizer and crashes...

Note we use sign extensions, bulk memory and non-trapping float conv by default. Perhaps this may cause to problem if you don't handle these ops yet

MaxGraey avatar Aug 30 '22 21:08 MaxGraey

Btw you can add all found missing opportunities for peephole opt here: https://github.com/WebAssembly/binaryen/issues/1764

MaxGraey avatar Aug 30 '22 21:08 MaxGraey

Ok, crash fixed. Running on as.release.wasm, the top results are very strange and look like superoptimizer bugs (or rather, things you really need a full solver to rule out). But after them:

8.

(i32.ne
 (i32.and
  (local.get $0)
  (i32.const 128)
 )
 (i32.const 128)
)
 =->
(i32.eqz
 (i32.and
  (local.get $0)
  (i32.const 128)
 )
)

9.

(i32.eqz
 (i32.shr_u
  (local.get $0)
  (i32.const 1)
 )
)
 =->
(i32.le_u
 (local.get $0)
 (i32.const 1)
)

kripken avatar Aug 30 '22 21:08 kripken

Well, 8. is not a bug:

export function test(x: i32): bool {
  return (x & 128) != 128;
}

This optimized by Binaryen as

(func $test (param $0 i32) (result i32)
  local.get $0
  i32.const 128
  i32.and
  i32.const 128
  i32.ne
)

however optimal:

(func $test (param $0 i32) (result i32)
  local.get $0
  i32.const 128
  i32.and 
  i32.eqz
)

MaxGraey avatar Aug 30 '22 21:08 MaxGraey

@MaxGraey

Btw you can add all found missing opportunities for peephole opt here: https://github.com/WebAssembly/binaryen/issues/1764

Good idea. I'll collect some more data, including priority, and post there. It may take a few hours or days though to get that data.

kripken avatar Aug 31 '22 00:08 kripken

If people have other files I should test on please let me know

Here is a 10MB file from a Go project and a 20MB file from a Rust project if you want. https://cdn.jsdelivr.net/npm/[email protected]/esbuild.wasm https://cdn.jsdelivr.net/npm/@swc/[email protected]/wasm_bg.wasm

merceyz avatar Aug 31 '22 11:08 merceyz

Nice !

It would be interesting to see what it can find here https://github.com/sola-st/WasmBench The dataset is here, https://github.com/sola-st/WasmBench/releases/download/v1.0/filtered-binaries-metadata.7z, ~8000 real Wasm binaries collected in the wild.

Jacarte avatar Aug 31 '22 15:08 Jacarte

Also, this might be helpful, https://github.com/bytecodealliance/wasm-tools/tree/main/crates/wasm-smith You can artificially generate some valid Wasm inputs and test the new feature.

Jacarte avatar Aug 31 '22 15:08 Jacarte

Thanks @merceyz and @Jacarte ! I'll test on those too.

After the last fixes this can handle a lot more IR, but that also makes it a lot slower. It's taking several hours on a large file... so this might take a while.

You can artificially generate some valid Wasm inputs and test the new feature.

"Fuzz" inputs like that is an interesting idea too. Aside from wasm-smith there is also the Binaryen fuzzer in this project which can do similar things (generate valid wasm files from random inputs). My worry though is that fuzz inputs might happen to find useful patterns that don't occur in the wild, and right now the superoptimizer treats all the input the same - as both things to optimize, and patterns to optimize against. Perhaps refactoring that (so fuzz inputs are only the latter) would help there.

kripken avatar Aug 31 '22 15:08 kripken

I found a way to make the superoptimizer fast enough to run on very large inputs, so I ran it yesterday on the combined wasm from several large real-world sources, including LLVM (C++ and Rust), Java, Go, Dart, and AssemblyScript. Doing it all at once maximizes the opportunity to find that one code generator does something clever that another one is missing. I filed issues with the top results, including the superoptimizer's best guess at their priority. Here is the list, starting from the most important:

  • #5004
  • #5005
  • #5006
  • #5007
  • #5008
  • #5009
  • #5010
  • #5011
  • #5012

This method of superoptimizing on everything at once has the limitation that it can't tell which of the inputs would benefit from each new optimization, since it doesn't track the origin before it mixes all the wasms together and superoptimizes. So there is no way to tell which of the code generators each optimization is "for", which could have been interesting... but anyhow, it's worth implementing all of those, I think.

kripken avatar Sep 02 '22 20:09 kripken

After implementing some of the top suggestions here, the next PR shows fairly low benefits when I test it, https://github.com/WebAssembly/binaryen/pull/5069 . It's possible that means the ones after it also have lower benefit, and are lower priority.

kripken avatar Sep 22 '22 22:09 kripken