wasmtime
wasmtime copied to clipboard
Call stack performance investigation
I am running ackermann benchmark with wasmtime, and I noticed that it had a performance delta when compared with native, of approx 30%. Profiling with VTune, I see wasmtime disassembly containing lot of setup/teardown function call stack related instructions at the beginning and end of the function, while native (clang, -O3) does not.
I used wasmtime explore to correlate the wat with disassembly as well. Here are the snippets of disassembly -
Wasm Setup of the stack -
Address Source Line Assembly
0x7f2b691d8040 0 push rbp
0x7f2b691d8041 0 mov rbp, rsp
0x7f2b691d8044 0 mov r10, qword ptr [rdi+0x8]
0x7f2b691d8048 0 mov r10, qword ptr [r10]
0x7f2b691d804b 0 cmp r10, rsp
0x7f2b691d804e 0 jnbe 0x7f2b691d80b7 <Block 9>
0x7f2b691d8054 0 Block 2:
0x7f2b691d8054 0 sub rsp, 0x10
0x7f2b691d8058 0 mov qword ptr [rsp], r12
0x7f2b691d805c 0 mov qword ptr [rsp+0x8], r15
0x7f2b691d8061 0 mov r15, rdi
0x7f2b691d8064 0 test edx, edx
0x7f2b691d8066 0 mov r12, rdx
0x7f2b691d8069 0 jz 0x7f2b691d80a2 <Block 8>
I have pasted the wat file of this function below as well for reference.
Wasm Teardown -
Address Source Line Assembly
0x7f2b691d80a2 0 lea eax, ptr [rcx+0x1]
0x7f2b691d80a5 0 mov r12, qword ptr [rsp]
0x7f2b691d80a9 0 mov r15, qword ptr [rsp+0x8]
0x7f2b691d80ae 0 add rsp, 0x10
0x7f2b691d80b2 0 mov rsp, rbp
0x7f2b691d80b5 0 pop rbp
0x7f2b691d80b6 0 ret
wat of relevant function -
(func (;3;) (type 5) (param i32 i32) (result i32)
local.get 0
if ;; label = @1
loop ;; label = @2
local.get 1
if (result i32) ;; label = @3
local.get 0
local.get 1
i32.const 1
i32.sub
call 3
else
i32.const 1
end
local.set 1
local.get 0
i32.const 1
i32.sub
local.tee 0
br_if 0 (;@2;)
end
end
local.get 1
i32.const 1
i32.add
)
Native disassembly is pretty short, the entirety of the function is as shown below (this is in at&t syntax, unlike Intel syntax in some above snippets) -
Address Source Line
0x1170 0 Block 1:
0x1170 0 pushq %rbx
0x1171 0 mov %esi, %eax
0x1173 0 test %edi, %edi
0x1175 0 jz 0x119f <Block 8>
0x1177 0 Block 2:
0x1177 0 mov %edi, %ebx
0x1179 0 jmp 0x118a <Block 5>
0x117b 0 Block 3:
0x117b 0 nopl %eax, (%rax,%rax,1)
0x1180 0 Block 4:
0x1180 0 mov $0x1, %eax
0x1185 0 add $0xffffffff, %ebx
0x1188 0 jz 0x119f <Block 8>
0x118a 0 Block 5:
0x118a 0 test %eax, %eax
0x118c 0 jz 0x1180 <Block 4>
0x118e 0 Block 6:
0x118e 0 add $0xffffffff, %eax
0x1191 0 mov %ebx, %edi
0x1193 0 mov %eax, %esi
0x1195 0 callq 0x1170 <Block 1>
0x119a 0 Block 7:
0x119a 0 add $0xffffffff, %ebx
0x119d 0 jnz 0x118a <Block 5>
0x119f 0 Block 8:
0x119f 0 add $0x1, %eax
0x11a2 0 popq %rbx
0x11a3 0 retq
and the C source function to generate wasm and native is -
int ackermann(int M, int N)
{
if (M == 0)
{
return N + 1;
}
if (N == 0)
{
return ackermann(M - 1, 1);
}
return ackermann(M - 1, ackermann(M, (N - 1)));
}
I also tried with --wasm-features tail-call cli flag, however that actually made the perf slightly worse.
Any pointers on the difference in disassembly between native and wasm?
Hi @rahulchaphalkar -- it looks like the difference is down to two fundamental factors:
- We have explicit stack checks rather than implicit stack probes and reliance on guard pages. We've actually just been discussing this in #8135. That's the business with
r10before decrementingrsp. - We have two clobber-saves (
r12andr15), whereas the native code gets away with one (rbx). It would be a good exercise to trace through the assembly and see what the registers are used for; perhaps the native compiler's register allocator is able to be a bit smarter about reuse. It is fundamentally necessary to have some state on the stack I think, since there is a recursive call (the one in non-tail position on the second-to-last line of C) and there is at least one word of state (M) necessary after it returns.
And FWIW, it is known that the tail calling convention can currently lead to some slow downs, which is why Wasm tail calls aren't enabled by default yet: https://github.com/bytecodealliance/wasmtime/issues/6759