Poor performance of wasmtime file I/O maybe because tokio
Test Case
test.c
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <fcntl.h>
int main()
{
FILE *fp;
char str[14];
for (int i = 1; i < 1000; i++)
{
int fd = open("test.txt", O_CREAT | O_WRONLY, S_IRUSR | S_IWUSR);
for (int j = 0; j < 1000; j++)
{
lseek(fd, i * 10 + j * 13, SEEK_SET);
write(fd, "hello world! ", 13);
}
fsync(fd);
close(fd);
}
return 0;
}
Steps to Reproduce
- first,compile test.c in the preceding test case into WASM bytecode using the WASI SDK
wasi-sdk/bin/clang -O3 test.c -o test.wasm
- second, WASMTIME AOT compile ,got test.aot.
wasmtime compile -W simd,relaxed-simd test.wasm -o test.aot
- third, Test Case Running Duration. It takes about 40 seconds.
time wasmtime run --allow-precompiled --dir ./ test.aot
Expected Results
wasmtime takes about the same time as native and wamr.
Actual Results
Wasmtime takes about 23 seconds. The same test.c, native or wamr only takes about 2 seconds.
Versions and Environment
Wasmtime version :16.0.0
Operating system: ubuntu 20.04
Architecture: aarch64 (same as x86 for this case)
Extra Info
Profile
# profile for wasmtime
perf record -g -k mono wasmtime run --profile=jitdump --allow-precompiled --dir ./ test.aot
sudo perf inject --jit --input perf.data --output perf.jit.data
perf report -i perf.jit.data
As shown in the following figure, most performance hotspots are on Tokio. This is because wasmtime uses Tokio to implement the file I/O interface, involving:
__imported_wasi_snapshot_preview1_fd_read
__imported_wasi_snapshot_preview1_fd_seek
__imported_wasi_snapshot_preview1_fd_sync
__imported_wasi_snapshot_preview1_fd_write
# profile for native
perf record -g -k mono ./test
perf report
System Call Times Statistics
As shown in the following figure, the number of wasmtime system call times is three times that of native. Is it because wasmtime uses tokio to implement file IO operations, and the number of file I/O operations is three times that of native, resulting in poor performance?
# strace for wasmtime
strace -c wasmtime run --allow-precompiled --dir ./ test.aot
# strace for native( ths same as wamr )
strace -c ./test
Why do we use Tokio to implement file I/O? Have we considered performance?
Thanks for this detailed report!
The short answer is, we haven't benchmarked performance of file IO yet, and right off the top of my head we have a couple optimization ideas we haven't explored because we were just trying to get things out the door.
I will dig in deeper and see if we can come up with some improvements here.
Thank you for your prompt response. Besides tokio, does wasmtime have any other I/O solutions?
Or is there any configuration on wasmtime that can improve tokio's io performance a bit?
Those are great questions. This is a long answer but this is a pretty involved topic, so please excuse this wall of text, and let me know if there is anything about this I should explain better.
We chose to use tokio for wasmtime-wasi's implementation partly because for many production embedders (i.e. not users of the wasmtime-cli binary, but Fastly and Fermyon and Cosmonic and etc who use the wasmtime crate directly in their server software), wasmtime and WASI need to be implemented on top of an async runtime, and tokio is the async ecosystem of choice for all of our known embedders.
Another significant factor is that wasmtime-wasi-http is built on an HTTP implementation and needs to integrate with the pollables and streams provided by wasmtime-wasi. For production HTTP stacks in Rust, hyper, which is built on tokio, is essentially the only option. A significant fraction of all internet traffic passes through hyper today - I'm not aware of any credible alternatives.
In order to fit in with the same pollable and stream resources as wasi-sockets, wasi-http, and etc, wasi-filesystem is also implemented on top of tokio. Because Linux doesn't provide a non-blocking way to do File IO (except for io_uring, which afaik no production-ready Rust systems are using yet, though hopefully they will soon), the correct way to do File IO in tokio is to move that work onto a special blocking IO thread-pool managed by tokio, which costs two synchronizations between threads per blocking operation for the executor to hand the work off from an async task, and to get notified of completion to resume the async task - that our hypothesis for where the futex and epoll_pwait syscalls are coming from.
I don't actually know where the extra write syscalls are coming from, but I haven't had time to dig in deeper yet. There may be an obvious low-hanging fruit there? If we could it cut down to just one write per WASI file blocking-write-and-flush that would be a nice win.
Finally, why does wasmtime-cli, which is a totally synchronous Rust program, use tokio under the hood to implement the WASI interface? Basically, because we only had time to implement wasmtime-wasi once. When you use wasmtime-wasi with a synchronous wasmtime (i.e. wasmtime::Config::async_support(false), which is the default), wasmtime-wasi provides the Linker with a shim that lazily creates a private tokio runtime, and calls the async Rust implementations of all the pollable and stream bits (and various other wasi-filesystem operations that turn into blocking-syscalls) underneath that private tokio.
There is no configuration available to change whether blocking file IO is moved to separate thread or not - its very fundamental to how tokio and the wasmtime-wasi implementation work. The only way to change that behavior would be to rewrite wasmtime-wasi with a completely different internal architecture, to solely use synchronous Rust.
A rewrite of wasmtime-wasi for the synchronous register of Rust would take a significant amount of time to write and maintain, and it would be challenging to integrate with the rest of the ecosystem - it would essentially cleave the wasmtime ecosystem in two. For example, wasmtime-wasi-http, or whatever other crates folks write beyond this tree that integrate with wasmtime-wasi - and thats virtually every interface that does any sort of IO, since streams and pollables are so fundamental - currently use wasmtime-wasi's pub trait Subscribe, pub trait HostInputStream, and pub trait HostOutputStream, which all use async Rust functions and assume a tokio executor. We don't expect that problem to change anytime soon, and in fact we may expect it to become even more deeply integrated with wasmtime as the Component Model works on having native async as part of the next big effort.
To give historical context: I designed wasi-common so that tokio was an optional dependency, and we were able to get away with it during WASI Preview 1 because WASI didnt yet have streams or an extensible poll interface. Our experience embedding wasi-common in various contexts where we needed to wait on e.g. HTTP request readiness, or treat HTTP bodies as a stream, highlighted that the design of WASI Preview 1's poll_oneoff was not the right design for composable systems. So, we changed the design of WASI itself in Preview 2 to have a pollable resource that can be created by any interface that needs to express waiting on readiness, and with that we more or less got forced into using async Rust for async trait Subscribe as the host implementation of each pollable.
We may be able to find a way, for synchronous embeddings, to break some of tokio's rules about performing blocking syscalls on the "main thread", because that operation should only affect wasmtime-wasi and other crates that build on top of it. If we could break those rules, we could provide a faster path to perform some blocking file IO operations (likely input-stream.blocking-read and output-stream.blocking-write-and-flush only) without the cost of context switches. However, thats just a guess we have, and I am apprehensive to break tokio's rules because the consequences of that for other crates (like wasmtime-wasi-http, as well as whatever other code that others implement outside this tree) could both be pretty severe, and difficult to understand or document. We will see if we can investigate that more, but to be totally honest its not at the top of my priorities right now. If you can help us understand the business case for improving performance, that may help us move it up in priority.
Thank you very much for explaining the reasons for choosing tokio in such detail.
Currently, I am using FlashDB as my database. FlashDB has two types of file APIs: LIBC file API (like fopen/fread/fwrte/fclose) and Posix file API (like open/read/write/close). After being compiled by the wasi-sdk, the two types of file APIs invoke the same type of interface:
__imported_wasi_snapshot_preview1_fd_read
__imported_wasi_snapshot_preview1_fd_seek
__imported_wasi_snapshot_preview1_fd_sync
__imported_wasi_snapshot_preview1_fd_write...and so on,.
For the FlashDB scenario, I now expect the file IO performance of wasmtime to be close to native, so I plan to try to modify the implementation of wasmtime's file IO, and directly use rust's file IO instead of using tokio. This is just a try, of course, very much look forward to your suggestions.
@pchickey: Just to add to the users of async wasmtime at scale, Microsoft also uses this, and has built it into the containerd/runwasi project for generalized Kubernetes usage as well. Great description of how this came about, thank you for the time and effort.
@liutao-liu Thanks for your response, that use case makes sense and its one that many users might encounter whether using FlashDB or sqlite or etc.
One detail I glossed over: we are using cap_std::fs::File (a thin wrapper on std File) to actually perform the io operations, not tokio::fs::File, but we are doing so inside a tokio::task::spawn_blocking, which makes it morally equivalent to the way tokio File wraps std File. The trick to avoiding the thread synchronizations is getting rid of spawn_blocking, which is architecturally kinda tricky to avoid, but it may be possible to hack through it, especially if you only need wasmtime-cli to work and not the more difficult composition scenarios I outlined.
Hi @pchickey,thank you for your modification tips. After I made the following changes according to your tips, the performance of wasmtime has been greatly improved.
In the above test case, the wasmtime time has been optimized from 23 seconds to 6 seconds. As you might expect, the futex system call is gone. Also, the write and read system calls become pwrite64 and pread64,i haven't figured out why yet.
Is it necessary to submit a PR for my changes? I added a run option to control whether or not to block in tokio. I think it is still necessary to provide an option for users to choose whether to use tokio
Thanks for testing that out @liutao-liu! I think it'd be reasonable to land something along these lines into the CLI itself, although I'd personally prefer to avoid a flag here since it'd be best to have the behavior turned on by default. What I might propose is something like:
- Add a new flag to
WasiCtxthat indicates whether blocking the current thread is ok. This defaults tofaultsbut the CLI would set it totrue. That would then be consulted duringspawn_blockingto do what you're doing, but from configuration viaWasiCtxinstead. - For the second slowdown, avoiding
in_tokio, I think the best solution would be to do that as part of the CLI. For example the CLI could callin_tokiooriginally and that way all of wasm is executed within a tokio context. I believe that would hit the fast path in thein_tokiofunction where nothing is done.
Would that work for you?
Thanks for testing that out @liutao-liu! I think it'd be reasonable to land something along these lines into the CLI itself, although I'd personally prefer to avoid a flag here since it'd be best to have the behavior turned on by default. What I might propose is something like:
- Add a new flag to
WasiCtxthat indicates whether blocking the current thread is ok. This defaults tofaultsbut the CLI would set it totrue. That would then be consulted duringspawn_blockingto do what you're doing, but from configuration viaWasiCtxinstead.- For the second slowdown, avoiding
in_tokio, I think the best solution would be to do that as part of the CLI. For example the CLI could callin_tokiooriginally and that way all of wasm is executed within a tokio context. I believe that would hit the fast path in thein_tokiofunction where nothing is done.Would that work for you?
I understand your first propose, I can move the flag to wasictx. That's a good idea, it would be simpler.
But I don't understand your second propose. Do you mean to run the whole wasmtime in tokio? Can you explain that in more detail?
For the second point, sorry I think this is actually the right function, namely with_ambient_tokio_runtime.
Can you try wrapping this invocation of Func::call in that function and see if it improves the performance you're seeing?
Thanks for testing that out @liutao-liu! I think it'd be reasonable to land something along these lines into the CLI itself, although I'd personally prefer to avoid a flag here since it'd be best to have the behavior turned on by default. What I might propose is something like:
- Add a new flag to
WasiCtxthat indicates whether blocking the current thread is ok. This defaults tofaultsbut the CLI would set it totrue. That would then be consulted duringspawn_blockingto do what you're doing, but from configuration viaWasiCtxinstead.- For the second slowdown, avoiding
in_tokio, I think the best solution would be to do that as part of the CLI. For example the CLI could callin_tokiooriginally and that way all of wasm is executed within a tokio context. I believe that would hit the fast path in thein_tokiofunction where nothing is done.Would that work for you?
Hello @alexcrichton , I tested the solution you proposed and it actually took an average of 14 seconds. Compared with the original solution (23 seconds), this improvement is not ideal. This is because each I/O operation is performed by invoking spawn_blocking, which still causes a large number of asynchronous waits.
Sorry, but to confirm, did you keep the changes mentioned above, e.g. the --block-file-io-in-tokio flag?
I sketched out what those changes might look like in this commit, but I think we'll both want to skip the spawn_blocking (conditionally) and additionally have the with_ambient_tokio_runtime bits.
Can you confirm whether that commit has the performance that you're looking for?
Sorry, but to confirm, did you keep the changes mentioned above, e.g. the
--block-file-io-in-tokioflag?I sketched out what those changes might look like in this commit, but I think we'll both want to skip the
spawn_blocking(conditionally) and additionally have thewith_ambient_tokio_runtimebits.Can you confirm whether that commit has the performance that you're looking for?
With these two changes, it takes 6 seconds, hardly any further improvement.
Also, you mentioned the new flags added in wasictx, which can't be read here.in_tokio
Oops sorry I forgot to actually turn the option on. I do realize that the goal is to avoid spawn_blocking, and most of the patch I linked was doing that. If you try this branch's latest commit that should remove the spawn_blocking. Can you test and see if that performance is what you're looking for?
Oops sorry I forgot to actually turn the option on. I do realize that the goal is to avoid
spawn_blocking, and most of the patch I linked was doing that. If you try this branch's latest commit that should remove thespawn_blocking. Can you test and see if that performance is what you're looking for?
Using your latest commit, it took 7 seconds. You haven't changed the code here. After I modified it based on yours, it took 6 seconds.
fn in_tokio<F: Future>(future: F) -> Result<F::Output> {
Ok(futures::executor::block_on(future))
}
Ok, thanks for confirming!
You're right that I didn't change in_tokio and that was intentional. Changing in_tokio would break other functionality, even on an opt-in basis, so we can't quite so easily switch to a different executor and assume that it works. Some users of in_tokio require the tokio event loop to get turned, and other users don't necessarily need it. To use a totally different executor we'd have to classify which is which.
I've opened https://github.com/bytecodealliance/wasmtime/pull/8190 with the changes I made above cleaned up a bit. That's probably at least a good new baseline to start from in terms of optimizing.
//hi @pchickey,
I'm still working on optimizing wasmtime's fd_read\fd_write performance, and I've found BorrowChecker to be a performance hotspot, at around 7% cpu. I looked at the history of commits for crates/wiggle and couldn't see why BorrowChecker was introduced. Do you know the historical background to the introduction of BorrowChecker?
Which practical application scenarios require the use of BorrowChecker, and does the wasmtime-cli also require the use of BorrowChecker?
Wasmtime's borrow checker should be modeled after Rust's borrow checker, right? The borrow checker of Rust will only be executed by the compiler in the static code analysis phase, so it will not have a negative impact on runtime performance. The borrow checker of wasmtime is executed during runtime, which can have a negative impact on performance. I don't know why is wasmtime not allowing hosts to hold simultaneously share borrow and mut borrow pointer of guest memory. I feel like this can be relaxed like C/C++, with .wasm ensuring memory security.
Issue#734 have some explaination of the backgroud of borrow checker, the proposal mentioned in the issue is which proposal?
I don't know why is wasmtime not allowing hosts to hold simultaneously
share borrowandmut borrowpointer ofguest memory. I feel like this can be relaxed like C/C++, with.wasmensuring memory security.
wasm memory sandboxing is not really related to BorrowChecker - BorrowChecker ensures that regardless of what a wasm program may provide to Wasmtime as hostcall arguments, Wasmtime cannot be caused to violate Rust safety rules. imagine if an embedder exposed a host-optimized memcpy(dest, src, len); if a wasm program called that with overlapping dest and src, implementing such a function by just forwarding the arguments to libc's memcpy() would yield UB in Wasmtime, even though the wasm guest was safely sandboxed from directly accessing memory outside its heap.
(BorrowChecker also ensures references are properly aligned and lie entirely inside the wasm memory they ought to be in, which are checks even a C/C++ embedder of wasmtime would need and would cause some overhead. this is all pretty similar to the kinds of checks pread(2) and similar do before using user inputs)
while it might be possible to defer these checks in some cases (passthrough to a host pread(2) is probably a good example..) it's still tricky in Wasmtime because we don't know that pread(2) isn't happening at the end of a call chain like:
wasm -> embedder_hostcall(buf, len) -> BorrowChecker in wasmtime tracks &[buf] -> call back into wasm -> wasm calls wasi-common's fd_read(buf, len). so from fd_read's perspective it's only safe to pass through arguments to the host OS if Wasmtime has no other borrow overlapping with the buffer to be read into.
iirc the proposal in #734 is basically what became BorrowChecker and BorrowChecker is still what what we plan to use even as wig/wit/witx has evolved, though Pat or Alex would definitely be more familiar with that.
re. simultaneously holding share borrow and mut borrow of guest memory, BorrowChecker exists partially to support that, erroring if those borrows would overlap. disallowing concurrent borrows of guest memory would make BorrowChecker much simpler and faster, but a much more limited host interface :sweat_smile:
The BorrowChecker type is quite old in Wasmtime and I don't have anything to add about its rationale over what @iximeow already said. That being said I believe it's over-powered relative to what we need it to do, hence the cost you're seeing, so I posted https://github.com/bytecodealliance/wasmtime/pull/8277 which should remove the performance issue you're seeing related to BorrowChecker. That comes at a cost of redefining what it does, but I believe that should be ok given how it's used today.