binaryen
binaryen copied to clipboard
Superoptimizer (2022)
Followup to #900 - this dusts off that code, gets it building and running, and adds some improvements.
As with #900 I'm not sure if this is worth landing, if we only run it once every few years, so the code is not polished.
I'm starting to run this now and will post results soon. Hopefully that can help prioritize work on peephole optimizations. I'll be running on some J2Wasm, Dart, and Go code that I happen to have, and which might be interesting as none of those use LLVM - that is my motivation for doing this now. In the past we only got a few % of speedup from superoptimization, but that was on LLVM output, so things may be different now.
If people have other files I should test on please let me know. Also you can run it yourself, just do wasm-analyze foo.wasm
.
f people have other files I should test on please let me know
Here's another pretty big file. This is the AssemblyScript compiler compiled by itself in release mode: as.release.wasm.zip
Findings so far:
1.
(select
(X)
(i32.const 0)
(X)
)
=>
(X)
1b.
(select
(i32.const 0)
(X)
(f32.eq
(X)
(f32.const 0)
)
)
=>
(X)
(can also happen with if)
2.
(i32.shl
(i32.shr_u
(local.get $0)
(i32.const 3)
)
(i32.const 3)
)
=->
(i32.and
(local.get $0)
(i32.const -8)
)
3.
(i32.and
(i32.add
(local.get $0)
(i32.const 8)
)
(i32.const 7)
)
=->
(i32.and
(local.get $0)
(i32.const 7)
)
4.
(i32.lt_s
(i32.shr_s
(local.get $0)
(i32.const 1)
)
(i32.const 12)
)
=->
(i32.lt_s
(local.get $0)
(i32.const 24)
)
5.
(i32.shl
(i32.sub
(i32.add
(local.get $0)
(i32.const 1)
)
(local.get $0)
)
(i32.const 2)
)
=->
(i32.const 4)
surprised we don't get that one already..? maybe a superoptimizer bug. also:
(i32.sub
(i32.add
(local.get $0)
(i32.const 1)
)
(local.get $0)
)
=->
(i32.const 1)
6.
(i32.add
(i32.mul
(local.get $0)
(i32.const 3)
)
(local.get $0)
)
=->
(i32.shl
(local.get $0)
(i32.const 2)
)
7.
(i32.or
(i32.lt_s
(local.get $0)
(i32.const 0)
)
(i32.ge_s
(local.get $0)
(i32.const 3)
)
)
=->
(i32.ge_u
(local.get $0)
(i32.const 3)
)
That's from J2Wasm and Dart. It ran very quickly on them, likely since it gives up whenever it seems a GC operation (it only optimizes numeric types). Still running on Go, which is MVP wasm so we can process more, and that will take some time.
Thanks @MaxGraey , looking at that file, it immediately hits a bug in the superoptimizer and crashes...
it immediately hits a bug in the superoptimizer and crashes...
Note we use sign extensions, bulk memory and non-trapping float conv by default. Perhaps this may cause to problem if you don't handle these ops yet
Btw you can add all found missing opportunities for peephole opt here: https://github.com/WebAssembly/binaryen/issues/1764
Ok, crash fixed. Running on as.release.wasm
, the top results are very strange and look like superoptimizer bugs (or rather, things you really need a full solver to rule out). But after them:
8.
(i32.ne
(i32.and
(local.get $0)
(i32.const 128)
)
(i32.const 128)
)
=->
(i32.eqz
(i32.and
(local.get $0)
(i32.const 128)
)
)
9.
(i32.eqz
(i32.shr_u
(local.get $0)
(i32.const 1)
)
)
=->
(i32.le_u
(local.get $0)
(i32.const 1)
)
Well, 8.
is not a bug:
export function test(x: i32): bool {
return (x & 128) != 128;
}
This optimized by Binaryen as
(func $test (param $0 i32) (result i32)
local.get $0
i32.const 128
i32.and
i32.const 128
i32.ne
)
however optimal:
(func $test (param $0 i32) (result i32)
local.get $0
i32.const 128
i32.and
i32.eqz
)
@MaxGraey
Btw you can add all found missing opportunities for peephole opt here: https://github.com/WebAssembly/binaryen/issues/1764
Good idea. I'll collect some more data, including priority, and post there. It may take a few hours or days though to get that data.
If people have other files I should test on please let me know
Here is a 10MB file from a Go project and a 20MB file from a Rust project if you want. https://cdn.jsdelivr.net/npm/[email protected]/esbuild.wasm https://cdn.jsdelivr.net/npm/@swc/[email protected]/wasm_bg.wasm
Nice !
It would be interesting to see what it can find here https://github.com/sola-st/WasmBench The dataset is here, https://github.com/sola-st/WasmBench/releases/download/v1.0/filtered-binaries-metadata.7z, ~8000 real Wasm binaries collected in the wild.
Also, this might be helpful, https://github.com/bytecodealliance/wasm-tools/tree/main/crates/wasm-smith You can artificially generate some valid Wasm inputs and test the new feature.
Thanks @merceyz and @Jacarte ! I'll test on those too.
After the last fixes this can handle a lot more IR, but that also makes it a lot slower. It's taking several hours on a large file... so this might take a while.
You can artificially generate some valid Wasm inputs and test the new feature.
"Fuzz" inputs like that is an interesting idea too. Aside from wasm-smith
there is also the Binaryen fuzzer in this project which can do similar things (generate valid wasm files from random inputs). My worry though is that fuzz inputs might happen to find useful patterns that don't occur in the wild, and right now the superoptimizer treats all the input the same - as both things to optimize, and patterns to optimize against. Perhaps refactoring that (so fuzz inputs are only the latter) would help there.
I found a way to make the superoptimizer fast enough to run on very large inputs, so I ran it yesterday on the combined wasm from several large real-world sources, including LLVM (C++ and Rust), Java, Go, Dart, and AssemblyScript. Doing it all at once maximizes the opportunity to find that one code generator does something clever that another one is missing. I filed issues with the top results, including the superoptimizer's best guess at their priority. Here is the list, starting from the most important:
- #5004
- #5005
- #5006
- #5007
- #5008
- #5009
- #5010
- #5011
- #5012
This method of superoptimizing on everything at once has the limitation that it can't tell which of the inputs would benefit from each new optimization, since it doesn't track the origin before it mixes all the wasms together and superoptimizes. So there is no way to tell which of the code generators each optimization is "for", which could have been interesting... but anyhow, it's worth implementing all of those, I think.
After implementing some of the top suggestions here, the next PR shows fairly low benefits when I test it, https://github.com/WebAssembly/binaryen/pull/5069 . It's possible that means the ones after it also have lower benefit, and are lower priority.