quinn
quinn copied to clipboard
Unexpected performance slowdown when using `tokio::spawn()` to parallelize stream operations
Bringing this over from the Gitter chat as its own issue:
I've identified a performance slowdown when attempting to use tokio::spawn() to parallelize a large number of stream back and forths on the same connection. I have a reproducer here: https://github.com/SriRamanujam/quinn_minimal_network_repro/tree/dc55069550a87063f303b9a773977767e8d61218 . There are two tunables, one to control whether the client uses spawn() or .await and another to change the number of tasks here: https://github.com/SriRamanujam/quinn_minimal_network_repro/blob/dc55069550a87063f303b9a773977767e8d61218/src/client/main.rs#L149-L152
In it, I open a connection and then use that connection to perform the following sequence of actions a bunch of times:
- open a stream on that connection
- poke the server to ask it to send data
- wait for the server to send back a 1 MiB payload. The payload is read off of disk server-side (it's the same 1MiB file each time, I just needed something somewhat large to send back)
Without using tokio::spawn() (or in other words, using .await in a loop instead of spawning in a loop) I see total RTTs on the client of a few milliseconds. Using tokio::spawn(), however, the RTT jumps up to several seconds on my laptop. All of the extra time is spent waiting for data to make its way through the connection, either when initially sending data to the server or when receiving data on the client. Here is a representative example of what I see (this is all over localhost):
❯ grep -hF '{num=230}' client.log server.log | sort --stable -k 1,3
# Using tokio::spawn()
Oct 10 20:28:40.847 INFO Request{num=230}:Making request and waiting for response: client: Sending request...
Oct 10 20:28:40.909 INFO Request{num=230}:Making request and waiting for response: client: Reading response...
Oct 10 20:28:40.980 INFO Response{num=230}: server: Received 3 bytes from stream: 230
Oct 10 20:28:41.122 INFO Response{num=230}: server: Read complete! Writing message to send stream...
Oct 10 20:28:46.093 INFO Request{num=230}:Making request and waiting for response: client: Response fully read!
Oct 10 20:28:46.110 INFO Response{num=230}: server: response handled!
# Using .await
Oct 10 20:29:33.285 INFO Request{num=230}:Making request and waiting for response: client: Sending request...
Oct 10 20:29:33.286 INFO Request{num=230}:Making request and waiting for response: client: Reading response...
Oct 10 20:29:33.286 INFO Response{num=230}: server: Received 3 bytes from stream: 230
Oct 10 20:29:33.286 INFO Response{num=230}: server: Read complete! Writing message to send stream...
Oct 10 20:29:33.297 INFO Request{num=230}:Making request and waiting for response: client: Response fully read!
Oct 10 20:29:33.297 INFO Response{num=230}: server: response handled!
Note that when using tokio::spawn(), sending 3 bytes from the client to the server took ~80 ms, and receiving 1 MiB of data took 5 seconds. When using .await, it took less than a millisecond to send 3 bytes, and 11 ms to receive 1 MiB. I would like to stress that the only difference between the two runs was that in one case I spawned 512 tasks on the client, one for each back-and-forth, and in the other I .awaited each task individually. In both runs the server code was absolutely identical.
When you use tokio::spawn, the requests will all run in parallel and share the available network bandwidth. Assuming a maximum available bandwidth on the connection and total fairness betweeen requests, it is expected that the transmission would be slower by factor num_reqs. But obviously the exact number also depend on the implementation, it's fairness, and the maximum number of concurrent streams the client and server allows.
Is your slowdown purely observed through that? The per request-duration would be higher, but the overall program runtime should be similar (or even faster)..
Further testing does lend some credence to your interpretation, @Matthias247. There are other things in play, like quinn not saturating the total bandwidth of the link, that I don't think have anything to do with this issue in particular. Even assuming total bandwidth saturation, though, this behavior still seems odd to me. Take the classic example of trying to optimize a website so that it has a low time to first paint. If, per chance, the actual HTML of the webpage was the last asset to fully download, then the time to first paint would be greater than if I had just opened a bunch of parallel connections a la HTTP/1 and pulled everything down myself.
Will clients now have to be smart about how many streams they open at once and what assets they prioritize over those streams? With this implementation, it would seem so.
quinn not saturating the total bandwidth of the link
What's your CPU usage like? quinn maybe CPU bound if the link isn't saturated, in which case investing some effort in the remaining low-hanging optimization fruit may be worthwhile.
Will clients now have to be smart about how many streams they open at once and what assets they prioritize over those streams?
If the application asks for concurrent streams, then concurrent streams is what they get. There's no way quinn can tell which data is more important to the application, short of #165 being implemented.
then the time to first paint would be greater than if I had just opened a bunch of parallel connections a la HTTP/1 and pulled everything down myself.
That depends on where the bottleneck is. If it's actually a network congestion problem, then opening multiple streams won't make anything faster. It just might help if there is a bottleneck in the implementation, or e.g. a single stream has a low amount of buffers. To be fair however: There is a decent chance this exists, since neither Quic nor HTTP/2 stacks have seen as much real world experience and optimization than plain TCP stacks.
With Quic there is also a decent chance of CPU resource exhaustion, due to less available hardware acceleration compared to to TCP. If that limit had been reached it also won't help to open up more streams (if those couldn't run on seperate cores with their own Quic endpoint).
Will clients now have to be smart about how many streams they open at once and what assets they prioritize over those streams? With this implementation, it would seem so.
They already have to do this! Clients have to be careful on how much they request at once, and what data. To help with that, HTTP/2 also introduced stream piorization, where clients could indicate to the server which of the concurrent downloads was the most important. Since the mechanism had its own set off issues HTTP/3 first removed it. However there are now RFCs to add it back on a higher layer.
There's precedent for QUIC reaching and even exceeding TCP speeds when sufficiently well tuned: https://www.fastly.com/blog/measuring-quic-vs-tcp-computational-efficiency. That article discusses a number of tactics we can adopt in quinn to improve performance significantly.