wasmtime
wasmtime copied to clipboard
Changing an immediate operand from 1 to 0 makes suspicion performance decreasing
Test Cases
Steps to Reproduce
Hi, I run the attached two cases(good.wasm
&bad.wasm
) in Wasmtime
and WasmEdge
(AOT), and collect their execution time respectively (measured by time
tool).
# command to collect execution time of wasmtime
wasmtime compile bad.wasm -o bad.cwasm
time wasmtime run --allow-precompiled bad.cwasm
# command to collect execution time of wasmedge
wasmedgec bad.wasm bad-wasmedge-aot.wasm
time wasmedge bad-wasmedge-aot.wasm
Expected Results & Actual Results
For good.wasm
, the execution time in different runtimes are as follows:
- Wasmtime: 3.07s
- WasmEdge: 1.18s
For bad.wasm
, the execution time in different runtimes are as follows:
- Wasmtime: 4.68s
- WasmEdge: 1.18s
The difference between the attached two cases is as follow, i.e., changing the value which will be store to address 0 from 1
to 0
outsize the loop, which can bring 1.6s performance decreasing on Wasmtime but has no effect on WasmEdge. Moreover, the address 0 will be then stored the same value 0, so the first storage will not affect the loop in theory.
➜ cases diff good.wat bad.wat
23c23
< i32.const 1
---
> i32.const 0
;; part of `good.wat`
i32.const 0
i32.const 1 ;; here
i32.store
i32.const 0
i32.const 0
i32.store
;; part of `bad.wat`
i32.const 0
i32.const 0 ;; here
i32.store
i32.const 0
i32.const 0
i32.store
And I check the machine code generated by Wasmtime
and WasmEdge
, and find the instruction sequence of Wasmtime
is different before and after change but the WasmEdge
one only changes an immediate of an instruction from 1
to 0
.
I have two confusions to the machine code generatd by Wasmtime
:
- Why the instruction sequence are different between two cases? In my view, maybe only changing the immediate of an instruction just as
WasmEdge
is enough. - Why the difference of the two instruction sequences which are both out of the loop, can cause 1.6s performance difference?
➜ binary-diff diff good.wasmedge.asm bad.wasmedge.asm
45c45
< 82: c7 02 01 00 00 00 movl $0x1,(%rdx)
---
> 82: c7 02 00 00 00 00 movl $0x0,(%rdx)
# difference of machine code generated by wasmtime
# good.wasm
29: 44 88 11 mov %r10b,(%rcx)
2c: c7 01 00 00 00 00 movl $0x0,(%rcx)
32: c7 01 00 00 00 00 movl $0x0,(%rcx)
38: c7 01 00 00 00 00 movl $0x0,(%rcx)
3e: 41 bb 01 00 00 00 mov $0x1,%r11d
44: c7 01 01 00 00 00 movl $0x1,(%rcx)
4a: c7 01 00 00 00 00 movl $0x0,(%rcx)
50: c7 01 00 00 00 00 movl $0x0,(%rcx)
56: 44 88 99 4c 01 00 00 mov %r11b,0x14c(%rcx)
5d: 44 88 91 bc 01 00 00 mov %r10b,0x1bc(%rcx)
# bad.wasm
29: 44 88 11 mov %r10b,(%rcx)
2c: c7 01 00 00 00 00 movl $0x0,(%rcx)
32: c7 01 00 00 00 00 movl $0x0,(%rcx)
38: c7 01 00 00 00 00 movl $0x0,(%rcx)
3e: c7 01 00 00 00 00 movl $0x0,(%rcx)
44: c7 01 00 00 00 00 movl $0x0,(%rcx)
4a: c7 01 00 00 00 00 movl $0x0,(%rcx)
50: b8 01 00 00 00 mov $0x1,%eax
55: 88 81 4c 01 00 00 mov %al,0x14c(%rcx)
5b: 44 88 91 bc 01 00 00 mov %r10b,0x1bc(%rcx)
Profiling Information
I use Perf tool to profile the hotspot and the results are as follows:
# good.wasm
│ xor %r10d,%r10d
0.01 │ mov %r10b,(%rcx)
0.01 │ movl $0x0,(%rcx)
20.82 │ movl $0x0,(%rcx)
0.05 │ movl $0x0,(%rcx)
18.60 │ mov $0x1,%r11d
0.03 │ movl $0x1,(%rcx)
20.42 │ movl $0x0,(%rcx)
22.37 │ movl $0x0,(%rcx)
17.68 │ mov %r11b,0x14c(%rcx)
0.01 │ mov %r10b,0x1bc(%rcx)
# bad.wasm
│ xor %r10d,%r10d
│ mov %r10b,(%rcx)
24.94 │ movl $0x0,(%rcx)
│ movl $0x0,(%rcx)
│ movl $0x0,(%rcx)
│ movl $0x0,(%rcx)
│ movl $0x0,(%rcx)
75.04 │ movl $0x0,(%rcx)
0.01 │ mov $0x1,%eax
│ mov %al,0x14c(%rcx)
0.01 │ mov %r10b,0x1bc(%rcx)
# the commands to profile
perf record -k mono ~/wasmtime/target/release/wasmtime run --profile jitdump --allow-precompiled bad.cwasm
perf inject --jit --input perf.data --output perf.jit.data
perf report --input perf.jit.data --no-children
Versions and Environment
- Wasmtime version or commit: 91ec9a589cc6c7f031ef4c
- Operating system: Linux ringzzz-OptiPlex-7070 5.15.0-97-generic
- Architecture: Intel(R) Core(TM) i5-9500T CPU @ 2.20GHz