Performance regression using new interpreter API
We have been using WABT 1.0.12 interpreter as a baseline for our benchmarking while developing Fizzy.
Recently we tried comparing WABT 1.0.12 against 1.0.19 on our benchmarks (mostly arithmetic-heavy code, crypto algos), and it showed significant slowdown in the latter (execution up to 2-3 times slower in many benchmarks; instantiation also slower but only up to ~15% slower).
We don't rule out the possibility, that we use the new interpreter API in some suboptimal way, so we would appreciate, if some obvious issues with it were pointed out.
Our code is in https://github.com/wasmx/fizzy/blob/ce3c446e62acce5db245a1287b7a524956277b87/test/utils/wabt_engine.cpp (Note: we do create one Thread object to be used in all execute iterations; with the old API we similarly created single Executor)
Detailed benchmarking results are in https://github.com/wasmx/fizzy/pull/443.
Thank you.
Note that the interpreter has been written for correctness and clarity, not performance. It is not unlikely that as new features get added, it will keep getting slower. The only way to not make it slow down is to emit specialized opcodes for all conditionals being introduced, which would very much go against the primary goals.
I know for example that with my recent addition of Memory64 support, extra conditionals were added to every memory access, and lookups to see which memory type we're dealing with. I certainly hope that alone wasn't a 2x slowdown :)
Maybe you could run your benchmark with various recent commits to isolate which one added the most slowdown?
I profiled wasm-interp and the issue is in "New interpreter (#1330)" introduced in 1.0.14.
I could not use our benchmarking tool and bigger set of inputs because the API changes a lot between 1.0.12 and 1.0.14 due to implementation of Wasm C API.
I ended up with using single input memset.wasm (memset.wat).
Time before: 0.003623 Time after: 0.009478 (2.61x)
I was not looking for reasons of the slowdown, but I noticed that BinOp function is not inlined in the new interpreter.
Perf stats
> perf stat -e cycles,instructions ./wasm-interp memset.wasm --run-all-exports
memset_bench() => i32:5100000
Performance counter stats for './wasm-interp memset.wasm --run-all-exports':
15 041 811 cycles
39 211 115 instructions # 2,61 insn per cycle
0,003593214 seconds time elapsed
0,003623000 seconds user
0,000000000 seconds sys
> perf stat -e cycles,instructions ./wasm-interp-new memset.wasm --run-all-exports
memset_bench() => i32:5100000
Performance counter stats for './wasm-interp-new memset.wasm --run-all-exports':
39 301 171 cycles
105 862 465 instructions # 2,69 insn per cycle
0,009431269 seconds time elapsed
0,009478000 seconds user
0,000000000 seconds sys
@binji any idea what part of that could the biggest slowdown?
Would be good to run it under a profiler to see if there's any low-hanging fruit.
IIRC there were some hacks in the old interpreter to try to keep more values in local variables while running the interpreter loop. The new interpreter is refactored to have a "Step" function instead, perhaps that's not being inlined/optimized. I'd guess there are a lot of opportunities to speed it up, but I'd worry a little about adding complexity.