wasmtime
wasmtime copied to clipboard
Avoid quadratic behavior in `can_optimize_var_lookup`
This fixes #4938/#4923, with a different take on #4924.
I think this maybe introduces worst-case quadratic behavior in declare_block_predecessor
instead, though I think it would happen much less often. I'd appreciate any performance insights during review.
I want to work some more on the comments before merging this, because it was hard to get right and involves subtle invariants. But it does pass the test suite now and I think it's correct.
Also, I haven't tested the performance impact of this patch yet, just its correctness. We should try both the cranelift-fuzzgen input from #4931 and @adambratschikaye's example wasm program from #4923. I can't do that until later though.
Also, we have this new benchmarking infrastructure that we're supposed to be able to trigger from a GitHub comment, right? I don't know how that works yet.
I've been running this in the fuzzer for the past 2 hours, and it hasn't complained so far.
Awesome, thank you @afonso360! I've just pushed some changes to the comments and an additional debug_assert
but I don't think that invalidates your fuzzing results.
I don't know how to explain this more clearly than with these updated comments, so I'm marking this ready for review. I still need to evaluate its performance impact though.
Running wasmtime-cli from main
on @adambratschikaye's example with cargo build --release --no-default-features --features wasmtime/wat
:
$ /usr/bin/time target/release/wasmtime compile --disable-cache /tmp/4923.wat
1.55user 0.02system 0:01.57elapsed 100%CPU (0avgtext+0avgdata 39532maxresident)k
0inputs+24outputs (0major+13706minor)pagefaults 0swaps
Under valgrind --tool=callgrind
that executes 25,270,685,419 instructions, with about 98% of the time in use_var
, and three-quarters of that hashing keys for the visited
hash-map.
Same test on this branch:
$ /usr/bin/time target/release/wasmtime compile --disable-cache /tmp/4923.wat
0.39user 0.02system 0:00.41elapsed 100%CPU (0avgtext+0avgdata 39744maxresident)k
0inputs+24outputs (3major+13731minor)pagefaults 0swaps
Only 2,803,649,669 instructions according to callgrind, but about 80% of that is in the new update_predecessor_cycle
function.
So this branch takes 11% as many instructions and 25% of the user time on that example.
Thanks for handling this @jameysharp!
Sure thing, @adambratschikaye; thanks for pointing out the issue!
Okay, I think it was not a coincidence that 75% of the time on main
was spent in hash-table operations, and my changes cut compile time by 75%. On the test case from #4923, both use_var
and declare_block_predecessor
are called the same number of times, and in some sense all I did was move the graph traversal from one to the other. So the entire difference is that on main
this uses a HashSet
, but with my PR it's indexing directly into a vector. Replacing the HashSet
with a SecondaryMap
, without adding any memoization, should have the same effect for this particular test case.
I suspect real programs average more than one call to use_var
per block, so I think this is still a good idea, but yesterday @cfallin suggested there might be a way to speed this check up for blocks that are sealed, so I've been playing with that. Turns out, it was a good idea.
When compiling our pulldown-cmark
benchmark from Sightglass, this saves about 50 million instructions, or about 3% of the total, according to callgrind. So I think this is a measurable improvement to real-world performance.
As for the original test case:
$ /usr/bin/time target/release/wasmtime compile --disable-cache /tmp/4923.wat
0.03user 0.03system 0:00.06elapsed 104%CPU (0avgtext+0avgdata 39644maxresident)k
0inputs+24outputs (0major+13739minor)pagefaults 0swaps
According to callgrind, this took 554,072,900 instructions.
Taking sealed blocks into account eliminates 98% of the compile time on this test case, both in terms of user time and instruction count, compared with main
. And at this point use_var_nonlocal
plus update_predecessor_cycle
take 0.29% of total runtime, so on this test there probably isn't any more performance to extract here.
Let's see if this works this time: /bench_x64
I finally ran Sightglass myself and this is a measurable improvement on real programs:
compilation :: cycles :: benchmarks/pulldown-cmark/benchmark.wasm
Δ = 49281270.08 ± 21356907.03 (confidence = 99%)
no-quadratic-3c0a6c74f.so is 1.09x to 1.22x faster than main-b2d13ebd4.so!
[233253012 368287922.00 470539364] main-b2d13ebd4.so
[226078864 319006651.92 393048464] no-quadratic-3c0a6c74f.so
compilation :: cycles :: benchmarks/spidermonkey/benchmark.wasm
Δ = 995084992.10 ± 398195457.23 (confidence = 99%)
no-quadratic-3c0a6c74f.so is 1.07x to 1.17x faster than main-b2d13ebd4.so!
[8794947694 8985953112.82 9227174380] main-b2d13ebd4.so
[5097575234 7990868120.72 9268837322] no-quadratic-3c0a6c74f.so
compilation :: cycles :: benchmarks/bz2/benchmark.wasm
Δ = 7813047.40 ± 7010542.19 (confidence = 99%)
no-quadratic-3c0a6c74f.so is 1.00x to 1.09x faster than main-b2d13ebd4.so!
[161067634 176419598.60 253954244] main-b2d13ebd4.so
[122513560 168606551.20 271954282] no-quadratic-3c0a6c74f.so