wasmtime icon indicating copy to clipboard operation
wasmtime copied to clipboard

Changing an immediate operand from 1 to 0 makes suspicion performance decreasing

Open hungryzzz opened this issue 9 months ago • 1 comments

Test Cases

cases.zip

Steps to Reproduce

Hi, I run the attached two cases(good.wasm&bad.wasm) in Wasmtime and WasmEdge(AOT), and collect their execution time respectively (measured by time tool).

# command to collect execution time of wasmtime
wasmtime compile bad.wasm -o bad.cwasm
time wasmtime run --allow-precompiled bad.cwasm

# command to collect execution time of wasmedge
wasmedgec bad.wasm bad-wasmedge-aot.wasm
time wasmedge bad-wasmedge-aot.wasm

Expected Results & Actual Results

For good.wasm, the execution time in different runtimes are as follows:

  • Wasmtime: 3.07s
  • WasmEdge: 1.18s

For bad.wasm, the execution time in different runtimes are as follows:

  • Wasmtime: 4.68s
  • WasmEdge: 1.18s

The difference between the attached two cases is as follow, i.e., changing the value which will be store to address 0 from 1 to 0 outsize the loop, which can bring 1.6s performance decreasing on Wasmtime but has no effect on WasmEdge. Moreover, the address 0 will be then stored the same value 0, so the first storage will not affect the loop in theory.

➜  cases diff good.wat bad.wat
23c23
<     i32.const 1
---
>     i32.const 0
;; part of `good.wat`
i32.const 0
i32.const 1  ;; here
i32.store
i32.const 0
i32.const 0
i32.store

;; part of `bad.wat`
i32.const 0
i32.const 0 ;; here
i32.store
i32.const 0
i32.const 0
i32.store

And I check the machine code generated by Wasmtime and WasmEdge, and find the instruction sequence of Wasmtime is different before and after change but the WasmEdge one only changes an immediate of an instruction from 1 to 0.

I have two confusions to the machine code generatd by Wasmtime:

  1. Why the instruction sequence are different between two cases? In my view, maybe only changing the immediate of an instruction just as WasmEdge is enough.
  2. Why the difference of the two instruction sequences which are both out of the loop, can cause 1.6s performance difference?
➜  binary-diff diff good.wasmedge.asm bad.wasmedge.asm
45c45
<   82:	c7 02 01 00 00 00    	movl   $0x1,(%rdx)
---
>   82:	c7 02 00 00 00 00    	movl   $0x0,(%rdx)


# difference of machine code generated by wasmtime
# good.wasm
29:	44 88 11                mov    %r10b,(%rcx)
2c:	c7 01 00 00 00 00    	movl   $0x0,(%rcx)
32:	c7 01 00 00 00 00    	movl   $0x0,(%rcx)
38:	c7 01 00 00 00 00    	movl   $0x0,(%rcx)
3e:	41 bb 01 00 00 00    	mov    $0x1,%r11d
44:	c7 01 01 00 00 00    	movl   $0x1,(%rcx)
4a:	c7 01 00 00 00 00    	movl   $0x0,(%rcx)
50:	c7 01 00 00 00 00    	movl   $0x0,(%rcx)
56:	44 88 99 4c 01 00 00 	mov    %r11b,0x14c(%rcx)
5d:	44 88 91 bc 01 00 00 	mov    %r10b,0x1bc(%rcx)

# bad.wasm
29:	44 88 11                mov    %r10b,(%rcx)
2c:	c7 01 00 00 00 00    	movl   $0x0,(%rcx)
32:	c7 01 00 00 00 00    	movl   $0x0,(%rcx)
38:	c7 01 00 00 00 00    	movl   $0x0,(%rcx)
3e:	c7 01 00 00 00 00    	movl   $0x0,(%rcx)
44:	c7 01 00 00 00 00    	movl   $0x0,(%rcx)
4a:	c7 01 00 00 00 00    	movl   $0x0,(%rcx)
50:	b8 01 00 00 00       	mov    $0x1,%eax
55:	88 81 4c 01 00 00    	mov    %al,0x14c(%rcx)
5b:	44 88 91 bc 01 00 00 	mov    %r10b,0x1bc(%rcx)

Profiling Information

I use Perf tool to profile the hotspot and the results are as follows:

# good.wasm
       │      xor    %r10d,%r10d
  0.01 │      mov    %r10b,(%rcx)
  0.01 │      movl   $0x0,(%rcx)
 20.82 │      movl   $0x0,(%rcx)
  0.05 │      movl   $0x0,(%rcx)
 18.60 │      mov    $0x1,%r11d
  0.03 │      movl   $0x1,(%rcx)
 20.42 │      movl   $0x0,(%rcx)
 22.37 │      movl   $0x0,(%rcx)
 17.68 │      mov    %r11b,0x14c(%rcx)
  0.01 │      mov    %r10b,0x1bc(%rcx)

# bad.wasm
       │      xor    %r10d,%r10d
       │      mov    %r10b,(%rcx)
 24.94 │      movl   $0x0,(%rcx)
       │      movl   $0x0,(%rcx)
       │      movl   $0x0,(%rcx)
       │      movl   $0x0,(%rcx)
       │      movl   $0x0,(%rcx)
 75.04 │      movl   $0x0,(%rcx)
  0.01 │      mov    $0x1,%eax
       │      mov    %al,0x14c(%rcx)
  0.01 │      mov    %r10b,0x1bc(%rcx)
# the commands to profile
perf record -k mono ~/wasmtime/target/release/wasmtime run --profile jitdump --allow-precompiled bad.cwasm
perf inject --jit --input perf.data --output perf.jit.data
perf report --input perf.jit.data --no-children

Versions and Environment

  • Wasmtime version or commit: 91ec9a589cc6c7f031ef4c
  • Operating system: Linux ringzzz-OptiPlex-7070 5.15.0-97-generic
  • Architecture: Intel(R) Core(TM) i5-9500T CPU @ 2.20GHz

hungryzzz avatar May 17 '24 16:05 hungryzzz