tokio
tokio copied to clipboard
`tokio::fs` + async is 1-2 orders of magnitude slower than a blocking version
Version 1.4.0
Platform 64-bit WSL2 Linux: Linux 4.19.104-microsoft-standard #1 SMP x86_64 x86_64 x86_64 GNU/Linux
Description The code is in this repo. The setup is explained in the README.
TL;DR:
- Implement toy clone of
du -hs
with blocking and async APIs. - Blocking
std::fs
is about 35% slower thandu
, not bad. - An async version that uses
tokio::fs
but processes files sequentially is 64x! slower than the blocking version. - An async version that tries to do as many things concurrently as possible using
FuturesUnordered
andselect!
is 2.5x faster than the sequential version, but still 25x slower than a simple blocking version.
I understand that tokio::fs
uses std::fs
under the hood, that there's no non-blocking system API to FS (modulo 'io-uring' but 🤷♂️), that async
has inherent overhead especially if the disk cache is hot and there's not much waiting on blocking calls.
However, 25x (not saying 64x) just feels too extreme of a slowdown, so I wonder
- if this is actually expected,
- or some
tokio::fs
code needs tuning/optimization, - or something else entirely (wrong setup?)
I mean, we already know that it is never going to be as fast as using the blocking APIs directly. Did you try with a non-blocking std::fs::read_dir
?
@Darksonn
Did you try with a non-blocking std::fs::read_dir?
Sorry, not sure what you mean by non-blocking std::fs::read_dir
. std::fs
provides a blocking API.
My setup is described here in detail, with code and perf data.
I mean, we already know that it is never going to be as fast as using the blocking APIs directly.
There are several things at play here. First there's overhead from async, then from tokio::fs
wrappers, but then there is a speed-up from parallel processing of files in case of async-par
implementation.
In any case 25x to 64x slow-down from going to tokio::fs
+ async
compared to a blocking version is pretty extreme, isn't it? We're talking about 200ms (feels instant) vs 12sec (feels like eternity) difference.
What I meant to suggest was to replace std::fs::read_dir
in the linked code with tokio::fs::read_dir
.
It is a big slowdown, and there have been several examples of people building really slow benchmarks and finding some trivial change to their code that yields a massive speedup, but those were all for reading the contents of the files. I think ultimately you are just running into a lot of back-and-forth between a bunch of threads, and that is just expensive.
@Darksonn let me try to clarify. As I explained in the README.md there are several branches, each has its own implementation:
-
sync branch uses
std::fs
; it can be considered a baseline, -
async-seq branch uses
tokio::fs
(incl.tokio::fs::read_dir
andtokio::fs::symlink_metadata
) and does processing sequentially (so option 1 but withtokio::fs
and.await
when necessary). This is 64x slower than option 1. Numbers are pretty much the same for both single- and multi-threaded runtimes. The amount of context switches is huge for both runtimes too. -
async-par branch uses
tokio::fs
, but also does as many things concurrently as possible by utilisingFuturesUnordered
andselect!
.
If there is a trivial change to my code that can make, say, async-seq
version to perform at least within 2x margin of sync
version, I'd be more than happy to learn what it is 🙂
I've done more testing on other platforms:
- On MacOS BigSur the 'async-seq' version is ~3x slower than 'sync', but 'async-par' is ~15% faster than 'sync'.
- On Windows 10 the 'async-seq' version is ~2.25x slower than 'sync', but 'async-par' is ~2x faster than `sync' (makes sense since my desktop has more cores than MBP and can benefit more from 'async-par').
This means that the issue is either Linux specific (unlikely) or WSL2 specific (seems more likely). I don't have a native linux box at hand to test this out now.
I also tried different version of rustc
(1.49, 1.50, 1.51) but observed similar behaviour.
I tried running it on my laptop which is a native Linux box, but async-par
failed with "too many open files". Here are the others:
Benchmark #1: du -hs ~/src
Time (mean ± σ): 813.7 ms ± 21.5 ms [User: 249.6 ms, System: 557.6 ms]
Range (min … max): 785.1 ms … 853.3 ms 10 runs
Benchmark #2: builds/sync ~/src
Time (mean ± σ): 884.7 ms ± 8.9 ms [User: 239.9 ms, System: 638.6 ms]
Range (min … max): 871.0 ms … 896.5 ms 10 runs
Benchmark #3: builds/async-seq ~/src
Time (mean ± σ): 5.603 s ± 0.059 s [User: 2.810 s, System: 4.733 s]
Range (min … max): 5.537 s … 5.735 s 10 runs
There were all built with --release
of course.
Great, so async-seq
is 6.3x slower, but not 64x, that's reassuring! 🙂
Could you try to increase the nofile limit and try async-par
again (e.g. ulimit -S -n 4096
may help)?
Sure.
Benchmark #1: builds/async-par ~/src
Time (mean ± σ): 4.462 s ± 1.566 s [User: 5.288 s, System: 7.233 s]
Range (min … max): 2.740 s … 7.184 s 10 runs
Thank you! async-par
performs better, but not to the extent I hoped. Both async versions are quite slow (good it's not 60x, but 6x is still a considerable slowdown).
I'm tempted to setup native Linux on my PC over the weekend and run it on the same set of files in Win, WSL2 and native Linux to have apples-to-apples comparison.
My main opinion on issues like this one is that if someone submits a PR that improves the speed of filesystem operations, I am happy to add those improvements (#3518 is an example), but it is not my a sufficiently large priority to spend time looking for fixes myself. People who need speedups for their fs ops can get already get speedups now by moving the operation into a single spawn_blocking
call.
@Darksonn that’s fair. To be clear, I don’t expect you spending time to diagnose the issue and come up with a fix, we all have different priorities and it’s fine.
The way I see it - this perf characteristics are surprising at the very least, so creating an issue is like putting a stick to the ground to tell “We’re aware of this”, and then maybe
- there will be an improvement PR - either someone stumble upon this issue and comes up with an improvement, or I will dig deeper when I have spare time.
- Or we will confirm that in this sort of workload the perf hit is just inherent and there’s nothing fishy going on. And in this case the good outcome would probably be a section in the docs, so that new users at least have awareness.
However, I can also see that this types of issues may be seen as not directly actionable. Which is totally fair. If this is the case for tokio
project - I’d be fine to close the issue.
And in any case, I apologise for the inconvenience if I had missed something about this in the guidelines.
Updated benchmarks https://github.com/artempyanykh/rdu:
- Same machine, same set of files.
- Ran on Native Linux, WSL2, and Native Windows;
- On Linux with warm and cold disk cache:
On Windows perf. profile is very different from Linux; naive async version is ~2.2x slower which is kind of acceptable. On native Linux with warm disk cache naive async version is 9x slower, and on WSL2 it's 55x slower.
This is referred to in this talk Java and Rust by Yishai Galatzer. They used Tokio async fs operations (on a benchmark) and compared that with Java NIO.
IMO, it unfairly pitches Rust as being too slow compared to Java, which is of course not really true.