wasmtime Call stack performance investigation

trafficstars

I am running ackermann benchmark with wasmtime, and I noticed that it had a performance delta when compared with native, of approx 30%. Profiling with VTune, I see wasmtime disassembly containing lot of setup/teardown function call stack related instructions at the beginning and end of the function, while native (clang, -O3) does not. I used wasmtime explore to correlate the wat with disassembly as well. Here are the snippets of disassembly -

Wasm Setup of the stack -

Address	Source Line	Assembly
0x7f2b691d8040	0	push rbp
0x7f2b691d8041	0	mov rbp, rsp							
0x7f2b691d8044	0	mov r10, qword ptr [rdi+0x8]
0x7f2b691d8048	0	mov r10, qword ptr [r10]
0x7f2b691d804b	0	cmp r10, rsp
0x7f2b691d804e	0	jnbe 0x7f2b691d80b7 <Block 9>							
0x7f2b691d8054	0	Block 2:							
0x7f2b691d8054	0	sub rsp, 0x10							
0x7f2b691d8058	0	mov qword ptr [rsp], r12
0x7f2b691d805c	0	mov qword ptr [rsp+0x8], r15
0x7f2b691d8061	0	mov r15, rdi
0x7f2b691d8064	0	test edx, edx							
0x7f2b691d8066	0	mov r12, rdx
0x7f2b691d8069	0	jz 0x7f2b691d80a2 <Block 8>

I have pasted the wat file of this function below as well for reference.

Wasm Teardown -

Address	Source Line	Assembly
0x7f2b691d80a2	0	lea eax, ptr [rcx+0x1]
0x7f2b691d80a5	0	mov r12, qword ptr [rsp]	
0x7f2b691d80a9	0	mov r15, qword ptr [rsp+0x8]
0x7f2b691d80ae	0	add rsp, 0x10							
0x7f2b691d80b2	0	mov rsp, rbp
0x7f2b691d80b5	0	pop rbp
0x7f2b691d80b6	0	ret

wat of relevant function -

(func (;3;) (type 5) (param i32 i32) (result i32)
    local.get 0
    if ;; label = @1
      loop ;; label = @2
        local.get 1
        if (result i32) ;; label = @3
          local.get 0
          local.get 1
          i32.const 1
          i32.sub
          call 3
        else
          i32.const 1
        end
        local.set 1
        local.get 0
        i32.const 1
        i32.sub
        local.tee 0
        br_if 0 (;@2;)
      end
    end
    local.get 1
    i32.const 1
    i32.add
  )

Native disassembly is pretty short, the entirety of the function is as shown below (this is in at&t syntax, unlike Intel syntax in some above snippets) -

Address	Source Line
0x1170	0	Block 1:							
0x1170	0	pushq  %rbx							
0x1171	0	mov %esi, %eax							
0x1173	0	test %edi, %edi							
0x1175	0	jz 0x119f <Block 8>							
0x1177	0	Block 2:							
0x1177	0	mov %edi, %ebx
0x1179	0	jmp 0x118a <Block 5>
0x117b	0	Block 3:							
0x117b	0	nopl  %eax, (%rax,%rax,1)							
0x1180	0	Block 4:							
0x1180	0	mov $0x1, %eax							
0x1185	0	add $0xffffffff, %ebx							
0x1188	0	jz 0x119f <Block 8>							
0x118a	0	Block 5:							
0x118a	0	test %eax, %eax							
0x118c	0	jz 0x1180 <Block 4>							
0x118e	0	Block 6:							
0x118e	0	add $0xffffffff, %eax							
0x1191	0	mov %ebx, %edi
0x1193	0	mov %eax, %esi
0x1195	0	callq  0x1170 <Block 1>
0x119a	0	Block 7:							
0x119a	0	add $0xffffffff, %ebx
0x119d	0	jnz 0x118a <Block 5>							
0x119f	0	Block 8:							
0x119f	0	add $0x1, %eax
0x11a2	0	popq  %rbx
0x11a3	0	retq

and the C source function to generate wasm and native is -

int ackermann(int M, int N)
{
    if (M == 0)
    {
        return N + 1;
    }
    if (N == 0)
    {
        return ackermann(M - 1, 1);
    }
    return ackermann(M - 1, ackermann(M, (N - 1)));
}

I also tried with --wasm-features tail-call cli flag, however that actually made the perf slightly worse. Any pointers on the difference in disassembly between native and wasm?

Mar 18 '24 23:03 rahulchaphalkar

Hi @rahulchaphalkar -- it looks like the difference is down to two fundamental factors:

We have explicit stack checks rather than implicit stack probes and reliance on guard pages. We've actually just been discussing this in #8135. That's the business with r10 before decrementing rsp.
We have two clobber-saves (r12 and r15), whereas the native code gets away with one (rbx). It would be a good exercise to trace through the assembly and see what the registers are used for; perhaps the native compiler's register allocator is able to be a bit smarter about reuse. It is fundamentally necessary to have some state on the stack I think, since there is a recursive call (the one in non-tail position on the second-to-last line of C) and there is at least one word of state (M) necessary after it returns.

Mar 19 '24 02:03 cfallin

And FWIW, it is known that the tail calling convention can currently lead to some slow downs, which is why Wasm tail calls aren't enabled by default yet: https://github.com/bytecodealliance/wasmtime/issues/6759

Mar 19 '24 15:03 fitzgen

wasmtime wasmtime copied to clipboard

Call stack performance investigation

wasmtime
wasmtime copied to clipboard