wasmtime icon indicating copy to clipboard operation
wasmtime copied to clipboard

Avoid quadratic behavior in `can_optimize_var_lookup`

Open jameysharp opened this issue 2 years ago • 4 comments

This fixes #4938/#4923, with a different take on #4924.

I think this maybe introduces worst-case quadratic behavior in declare_block_predecessor instead, though I think it would happen much less often. I'd appreciate any performance insights during review.

I want to work some more on the comments before merging this, because it was hard to get right and involves subtle invariants. But it does pass the test suite now and I think it's correct.

jameysharp avatar Sep 21 '22 18:09 jameysharp

Also, I haven't tested the performance impact of this patch yet, just its correctness. We should try both the cranelift-fuzzgen input from #4931 and @adambratschikaye's example wasm program from #4923. I can't do that until later though.

Also, we have this new benchmarking infrastructure that we're supposed to be able to trigger from a GitHub comment, right? I don't know how that works yet.

jameysharp avatar Sep 21 '22 18:09 jameysharp

I've been running this in the fuzzer for the past 2 hours, and it hasn't complained so far.

afonso360 avatar Sep 21 '22 21:09 afonso360

Awesome, thank you @afonso360! I've just pushed some changes to the comments and an additional debug_assert but I don't think that invalidates your fuzzing results.

I don't know how to explain this more clearly than with these updated comments, so I'm marking this ready for review. I still need to evaluate its performance impact though.

jameysharp avatar Sep 21 '22 21:09 jameysharp

Running wasmtime-cli from main on @adambratschikaye's example with cargo build --release --no-default-features --features wasmtime/wat:

$ /usr/bin/time target/release/wasmtime compile --disable-cache /tmp/4923.wat
1.55user 0.02system 0:01.57elapsed 100%CPU (0avgtext+0avgdata 39532maxresident)k
0inputs+24outputs (0major+13706minor)pagefaults 0swaps

Under valgrind --tool=callgrind that executes 25,270,685,419 instructions, with about 98% of the time in use_var, and three-quarters of that hashing keys for the visited hash-map.

Same test on this branch:

$ /usr/bin/time target/release/wasmtime compile --disable-cache /tmp/4923.wat
0.39user 0.02system 0:00.41elapsed 100%CPU (0avgtext+0avgdata 39744maxresident)k
0inputs+24outputs (3major+13731minor)pagefaults 0swaps

Only 2,803,649,669 instructions according to callgrind, but about 80% of that is in the new update_predecessor_cycle function.

So this branch takes 11% as many instructions and 25% of the user time on that example.

jameysharp avatar Sep 21 '22 22:09 jameysharp

Thanks for handling this @jameysharp!

adambratschikaye avatar Sep 22 '22 11:09 adambratschikaye

Sure thing, @adambratschikaye; thanks for pointing out the issue!

Okay, I think it was not a coincidence that 75% of the time on main was spent in hash-table operations, and my changes cut compile time by 75%. On the test case from #4923, both use_var and declare_block_predecessor are called the same number of times, and in some sense all I did was move the graph traversal from one to the other. So the entire difference is that on main this uses a HashSet, but with my PR it's indexing directly into a vector. Replacing the HashSet with a SecondaryMap, without adding any memoization, should have the same effect for this particular test case.

I suspect real programs average more than one call to use_var per block, so I think this is still a good idea, but yesterday @cfallin suggested there might be a way to speed this check up for blocks that are sealed, so I've been playing with that. Turns out, it was a good idea.

When compiling our pulldown-cmark benchmark from Sightglass, this saves about 50 million instructions, or about 3% of the total, according to callgrind. So I think this is a measurable improvement to real-world performance.

As for the original test case:

$ /usr/bin/time target/release/wasmtime compile --disable-cache /tmp/4923.wat
0.03user 0.03system 0:00.06elapsed 104%CPU (0avgtext+0avgdata 39644maxresident)k
0inputs+24outputs (0major+13739minor)pagefaults 0swaps

According to callgrind, this took 554,072,900 instructions.

Taking sealed blocks into account eliminates 98% of the compile time on this test case, both in terms of user time and instruction count, compared with main. And at this point use_var_nonlocal plus update_predecessor_cycle take 0.29% of total runtime, so on this test there probably isn't any more performance to extract here.

Let's see if this works this time: /bench_x64

jameysharp avatar Sep 22 '22 22:09 jameysharp

I finally ran Sightglass myself and this is a measurable improvement on real programs:

compilation :: cycles :: benchmarks/pulldown-cmark/benchmark.wasm

  Δ = 49281270.08 ± 21356907.03 (confidence = 99%)

  no-quadratic-3c0a6c74f.so is 1.09x to 1.22x faster than main-b2d13ebd4.so!

  [233253012 368287922.00 470539364] main-b2d13ebd4.so
  [226078864 319006651.92 393048464] no-quadratic-3c0a6c74f.so

compilation :: cycles :: benchmarks/spidermonkey/benchmark.wasm

  Δ = 995084992.10 ± 398195457.23 (confidence = 99%)

  no-quadratic-3c0a6c74f.so is 1.07x to 1.17x faster than main-b2d13ebd4.so!

  [8794947694 8985953112.82 9227174380] main-b2d13ebd4.so
  [5097575234 7990868120.72 9268837322] no-quadratic-3c0a6c74f.so

compilation :: cycles :: benchmarks/bz2/benchmark.wasm

  Δ = 7813047.40 ± 7010542.19 (confidence = 99%)

  no-quadratic-3c0a6c74f.so is 1.00x to 1.09x faster than main-b2d13ebd4.so!

  [161067634 176419598.60 253954244] main-b2d13ebd4.so
  [122513560 168606551.20 271954282] no-quadratic-3c0a6c74f.so

jameysharp avatar Sep 23 '22 16:09 jameysharp