wasm-micro-runtime Performance regression when i32.load and i32.store target the same address

Subject of the issue

Hello, when executing i32.load and i32.store in a loop, and both operate on the same target address, I have observed a performance drop in Wasmer (Cranelift backend) and WAMR Fast JIT. Other runtimes (LLVM-based backend, Wasmtime, WasmEdge JIT) did not exhibit similar issues.

The timing data is as follows:

	Same address	Different addresses
wasmer_llvm	1.262783	1.326196
wasmedge_jit	1.19138	0.654879
wamr_llvm_jit	1.199998	1.186887
wasmer_cranelift	8.269836	1.198884
wasmtime	1.249896	1.397831
wamr_fast_jit	8.42154	3.590291

The data is in seconds, and each data is the result of ten executions and averages.

When load and store use the same address, the execution time of wasmer and WAMR Fast JIT is significantly higher than that of other runtime tools. However, when load and store use different addresses, this anomalous performance difference nearly disappears.

Test case

The minimal reproducible code is as follows:

test_case.wat

(module
  (type (;0;) (func (param i32)))
  (type (;1;) (func))
  (type (;2;) (func (result i32)))
  (import "wasi_snapshot_preview1" "proc_exit" (func (;0;) (type 0)))
  (func (;1;) (type 1))
  (func (;2;) (type 1)
    (local i32 i32)
    (local.set 1
      (i32.const 0))
    (loop

      ;;read-modify-write
      (i32.store
        (i32.const 1040)
        (i32.add
          (i32.load
            (i32.const 1040))
          (i32.const 1)))
          
      (local.set 1
        (i32.add
          (local.get 1)
          (i32.const 1)))
      (br_if 0 (;@3;)
        (i32.ne
          (local.get 1)
          (i32.const 0))))
    (call 0
      (i32.const 0))
    (unreachable))
  (func (;3;) (type 0) (param i32)
    (global.set 0
      (local.get 0)))
  (func (;4;) (type 2) (result i32)
    (global.get 0))
  (table (;0;) 2 2 funcref)
  (memory (;0;) 258 358)
  (global (;0;) (mut i32) (i32.const 66592))
  (export "memory" (memory 0))
  (export "__indirect_function_table" (table 0))
  (export "_start" (func 2))
  (export "_emscripten_stack_restore" (func 3))
  (export "emscripten_stack_get_current" (func 4))
  (elem (;0;) (i32.const 1) func 1))

Your environment

The runtime tools are all built on release and use JIT mode.

wasmer: 6.0.1
wasmtime: 35.0.0 (9c2e6f17c 2025-06-17)
wasmedge: 0.15.0-alpha.4-5-g7491f8c7
WAMR: iwasm 2.4.0
wabt: 1.0.27
llvm: 18.1.8
Host OS: Ubuntu 22.04.5 LTS x64
CPU: 12th Gen Intel® Core™ i7-12700 × 20

Steps to reproduce

wat2wasm test_case.wat -o test_case.wasm

# Execute the wasm file and collect data
perf stat -r 10 -e 'task-clock' /path/to/wasmer run test_case.wasm
perf stat -r 10 -e 'task-clock' /path/to/wasmer run test_case.wasm --llvm
perf stat -r 10 -e 'task-clock' /path/to/wasmtime test_case.wasm
perf stat -r 10 -e 'task-clock' /path/to/wasmedge --enable-jit test_case.wasm
perf stat -r 10 -e 'task-clock' /path/to/build_fast_jit/iwasm test_case.wasm
perf stat -r 10 -e 'task-clock' /path/to/build_llvm_jit/iwasm test_case.wasm

Expected and actual behavior

The above data indicates that the performance degradation is related to how wasmer and WAMR Fast JIT handle read-modify-write scenarios.

I guess: It may be due to a lack of proper optimization when handling read-modify-write scenarios, or it may trigger inefficient memory access patterns.

Extra Info

I also submitted a related issue to wasmer regarding this phenomenon. If you need any other relevant information, please let me know and I will do my best to provide it. Looking forward to your reply! Thank you!

Sep 11 '25 07:09 gaaraw

You can build with WAMR_BUILD_FAST_JIT_DUMP to see what's difference the asm code is between Same address and Different addresses. I think maybe it's the register allocation issue

Sep 12 '25 07:09 TianlongLiang

Thank you for your reply! According to your reply, I made some efforts: after rebuilding and executing the wasm file in the terminal, I pasted all the output into txt files.

asm_code_same.txt

asm_code_diff.txt

Here is a quick comparison of the generated assembly code for the two cases:

Case 1: Same address (load/store to the same memory location) Excerpt from asm_code_same.txt:

mov eax, [r8+r9*1]   ; load mem[addr]
inc eax              ; add 1
mov r9, 0x9EB510     ; reload constant
mov [r8+r9*1], eax   ; store back

The code expands into a load -> increment -> store sequence.
Extra register (r9) is reloaded for address calculation each time.
The JIT does not generate a more efficient instruction such as inc dword ptr [mem].
This longer sequence in a large loop explains the severe slowdown.

Case 2: Different addresses (load from one, store to another) Excerpt from asm_code_diff.txt:

mov eax, [rbp+0x50]   ; load from addr1
...
mov [r8+0x1D0], eax   ; store to addr2

Load and store are kept independent.
No redundant address reload, simpler data flow.
Performance is normal and consistent with other runtimes.

Sep 12 '25 08:09 gaaraw