wasmtime Wasmtime horizontal scaling results in poor performance

Wasmtime horizontal scaling results in poor performance

Open Duslia opened this issue 1 year ago • 24 comments

When I try to use tokio to scale wasmtime horizontally, I found that wasmtime performance drops significantly. It looks like there are some shared resources inside. And no matter how many threads I open, the total number of execute num is almost the same. This is my performance data:

thread and task	data	total execute nums
thread 1, task 1	11w/task	11w
thread 3, task 3	4w/task	12w
thread 8, task 8	1w/task	8w

This is my test code: https://play.rust-lang.org/?version=nightly&mode=debug&edition=2021&gist=e9fc13a7d68b80afdc07076482a5b787 .

And the wasm code is just a function return 1.

package main

//export echo
func echo(ctxLen, reqLen int32) int32 { //nolint
	return 1
}

func main() {
}

Aug 08 '22 10:08 Duslia

Thanks for the report! Could you share the Wasmtime version you're using as well as the specific *.wasm file that you're benchmarking?

Aug 08 '22 15:08 alexcrichton

That being said what you're almost certainly running into is limits on concurrently running mmap in multiple threads within one process. Allocation of fiber stacks for asynchronously executing WebAsembly primarily goes through mmap right now and unmapping a stack requires broadcasting to other threads with an IPI which slows down concurrent execution.

One improvement that you can do to your benchmark is to use the pooling allocator instead of the on-demand allocator. Otherwise though improvements need to come from the kernel itself (assuming you're running Linux, which would also be great to specify in the issue description). Historically the kernel has prototyped a madvisev syscall which allows batch unmappings that Wasmtime could use with the pooling allocator. We've tested this historically and it's shown to greatly increase the throughput in situations like this. This isn't landed in Linux, however.

Finally it depends on your use case of whether this comes up much in practice. Typically the actual work performed by a wasm module dwarfs the cost of the memory mappings since the wasm modules live for a longer time than "return 1". If that's the case then this microbenchmark isn't necessarily indicative of performance and it's recommended to use a more realistic workload for your use case. If this sort of short-running computation is common for your embedding, however, then multithreaded execution will currently have the impact that you're measuring here.

Aug 08 '22 15:08 alexcrichton

Thanks!! I got it.

Aug 09 '22 07:08 Duslia

I'm trying to comment out these two lines and it can scale horizontally with a high performance. I 'd like to ask that is the function of these lines of code to prevent the previous data from being obtained in the next execution?

Aug 09 '22 12:08 Duslia

Yes, without that call the old data would be preserved when reusing a memory region as linear memory for the wasm module.

Aug 09 '22 13:08 bjorn3

Ah ok that's a good confirmation that madvise and IPIs are indeed likely your scaling bottleneck. In that case there's unfortunately not much that can be done right now that Wasmtime can do. This is some (old) discussion about a possible madvisev syscall which would largely fix this scaling bottleneck within Wasmtime. This was confirmed with a modified kernel with that patch and developing a version of Wasmtime that uses madvisev.

In the meantime though this is something that basically just needs to be taken into account when capacity planning with Wasmtime. If you're willing assistance in getting a patch such as that into the Linux kernel (or investigating other routes of handling this bottleneck) would be greatly appreciated!

Aug 09 '22 14:08 alexcrichton

Oh sorry, and to confirm the madvise call is absolutely critical for safety, it cannot be removed and still have a secure sandbox for wasm.

Aug 09 '22 14:08 alexcrichton

OK. I will think about how to solve it. Thanks a lot!!

Aug 09 '22 15:08 Duslia

I'm wondering if the following 2 solutions may be accepted?

Call madvise with io_uring: It saves syscall but I don't know if it will work to avoid scaling bottleneck.
Maintain 2 index groups, one is dirty, the other one is clean. On drop, the index is put into dirty group and try to merge with others, and won't be madvised instantly. On index allocation, try to get from the clean one, and if there's no available in it, try to obtain a continuous range from dirty group and call madvise, then return one of it back and put all the index left inside the range into the clean group. In this solution, we can call less madvise by clean stacks in batch.

Aug 10 '22 02:08 ihciah

Call madvise with io_uring: It saves syscall but I don't know if it will work to avoid scaling bottleneck.

As I understand it the IPI is the bottlwneck, not the syscall overhead. Every time madvise is run all cores running the wasmtime process have to be temporarily interupted and suspended while the page table is being modified. This scales really badly. io_uring can't do anything against this.

Aug 10 '22 06:08 bjorn3

Call madvise with io_uring: It saves syscall but I don't know if it will work to avoid scaling bottleneck.

As I understand it the IPI is the bottlwneck, not the syscall overhead. Every time madvise is run all cores running the wasmtime process have to be temporarily interupted and suspended while the page table is being modified. This scales really badly. io_uring can't do anything against this.

Thanks for the explain! I also checked the kernel code and found in func io_madvise of io_uring.c, it just call do_madvise, the lock will be acquired inside do_madvise, which means there's no aggregation with it. To do madvise aggregation without modifying kernel, what about the solution 2? I implemented a demo here(not finished yet): https://github.com/ihciah/wasmtime/commit/071f5d9978d94ee669d72c90bf2d6f2728324303

Aug 10 '22 07:08 ihciah

I might be misunderstanding, but isn't that what the pooling allocator does?

Aug 10 '22 07:08 bjorn3

I might be misunderstanding, but isn't that what the pooling allocator does?

The pool inside the wasmtime allocates a big memory one time in a single call, and manages it by maintaining index of memory pages manually. Each time the memory page(memory_pages, table_pages, stack_pages) dropped, it will clear the memory content by calling decommit->madvise, and mark the index available for the later allocation. And here is the problem. What I want to impl is to make the indices to recycle continues and clear the memory in batch. I think it is a kind aggregation in userspace. (I started reading wasmtime code just 2 from days ago. Maybe there's something I misunderstand..)

Aug 10 '22 08:08 ihciah

Call madvise with io_uring

While we haven't directly experimented with this I think as you've discovered we took a look and concluded it probably wouldn't help.

Maintain 2 index groups, one is dirty, the other one is clean.

This we have actually experimented with historically. Our conclusion was that it did not actually help all that much. We found that the cost of an madvise scaled with the size of the region being advised, which meant that when we batched everything up it took effectively just as long as doing everything individually. We had various hypotheses for why this was the case but we didn't bottom it out 100% at the time.

Our testing with madvisev is mostly what led us to stop investigating this issue because madvisev so clearly fixed the issue relative to all other workarounds.

Aug 10 '22 14:08 alexcrichton

We found that the cost of an madvise scaled with the size of the region being advised

Thank you for sharing past experiments.

But it still confuse me that advising multiple regions with madvisev is better than madvise with a merged region. If this is true, does this means we can impl a better madvise in kernel by simply split the region and do it one by one?

Aug 10 '22 17:08 ihciah

You can't merge the regions to madvise - they're not contiguous.

Aug 10 '22 17:08 pchickey

You can't merge the regions to madvise - they're not contiguous.

I mean maybe we can make the indices as contiguous as possible and then we can do madvise on the biggest contiguous region when alloc. I don't know if this will work.

Aug 10 '22 17:08 ihciah

IIRC we were just as confused as you are. I remember though reorganizing the pooling allocator to put the table/memory/instance allocations all next to each other so each instance's index identified one large contiguous region of memory. In the benchmarks for this repository parallel instantiation actually got worse than the current performance on main. My guess at the time was that it had to do with the very large size of the madvise since madvise got more expensive as the region got larger.

As for kernel improvements that's never something we've looked into beyond madvisev

Aug 10 '22 18:08 alexcrichton

I finished my experiment and result shows the same conclusion. What I did: clear stack pool in batch(https://github.com/ihciah/wasmtime/tree/main) Result: running return 1 demo(on a KVM vm)

Original 1 core: 117 kQPS
Original 3 cores: 66 kQPS * 3
Exp 1 core: 133 kQPS
Exp 3 cores: 52kQPS * 3

Under 1 core, the modified version has slightly better performance(the madvise batch is 1000 times original); but under 3 cores it becomes worse.

My guess at the time was that it had to do with the very large size of the madvise since madvise got more expensive as the region got larger.

Also, I changed stack pool size 1000 to bigger values, size=2048 => 20ms and size=4096 => 45ms, which matches your guess. I haven't looked closely at the madvisev impl yet. Comparing to calling madvise, what does it saves? It still need the same IPI in theory.

Aug 12 '22 03:08 ihciah

Always good to have independent confirmation, thanks for doing that!

As for why madvisev improves the situation, I believe this happens because each call to madvise issues an IPI. With madvisev only one IPI is necessary still. This means that instead of N calls to madvisev and N IPIs there's only one call to madvisev with only one IPI. From the kernel's perspective all the calling process needs is to know that by the time the syscall returns that the address space is as-expected in all threads, but during execution of the syscall there's a lot more flexibility about precisely what all threads see.

Aug 15 '22 14:08 alexcrichton

Thanks for the explanation. But I guess IPI cost(or other bottleneck) is not linear with syscall times but the memory ranges since doing madvise with twice bigger memory a little bit cost more than twice longer time.

It seems there's no easy way to save the madvise cost and avoid its bottleneck for general usage... Since users may control the instance isolation by their own, for example they can make sure only the same tenant and the same wasm binary share instances. Under these circumstances doing madvise maybe unneccessary. Is it acceptable to add an option to disable madvise(madvise can be still enabled by default, and the option can only be touched when some features like "dangerous" enabled)?

Aug 19 '22 07:08 ihciah

Since users may control the instance isolation by their own, for example they can make sure only the same tenant and the same wasm binary share instances. Under these circumstances doing madvise maybe unneccessary.

In that case you could reuse the Instance instead of recreate it every time, right? Not calling madvise is effectively equivalent to reusing the Instance while resetting all tables and wasm globals (actual global variables are in the linear memory and as such not reset), but not the linear memory, in which case you might as well reset nothing. That prevents everything getting out of sync and most observable state isn't reset anyway if you don't do madvise.

Aug 19 '22 07:08 bjorn3

Not calling madvise is effectively equivalent to reusing the Instance while resetting all tables and wasm globals (actual global variables are in the linear memory and as such not reset), but not the linear memory, in which case you might as well reset nothing.

I see. There are 3 callers for decommit, the decommit_memory_pages and decommit_table_pages are with the instance lifecycle. So we may not affect them. The hot path is the third caller decommit_stack_pages. Every call for func.call_async will do it. So we may only disable decommit_stack_pages optionally.

Aug 19 '22 08:08 ihciah

Using madvise to clear linear memory and tables is required for correctness and we can't add a knob to turn that off, but I think it would be reasonable to add a knob for the stacks to avoid madvise since that's purely a defense-in-depth thing and is not required for correctness.

Aug 19 '22 19:08 alexcrichton

The *_keep_resident options added recently should additionally help by reducing page faults and therefore contention on kernel-level locks. There's also work towards optimizing bounds-checking-mode in Cranelift to avoid pagefaults/mprotect even more to reduce contention even further.

Otherwise though this is a known issue and isn't something where there's any easy wins on the table we know of that haven't been applied yet. PRs and more discussion around this are of course always welcome but I'm going to go ahead and close this since there's not a whole lot more that can be done at this time.

Dec 02 '22 00:12 alexcrichton

wasmtime wasmtime copied to clipboard

Wasmtime horizontal scaling results in poor performance

wasmtime
wasmtime copied to clipboard