quinn
quinn copied to clipboard
Unify connection and endpoint drivers
Fixes https://github.com/quinn-rs/quinn/issues/914. This is a high-risk change, so we probably don't want it in 0.8, but I'm getting a head start on it all the same.
This could probably use some more iteration, but is now functionally complete. Next step is careful benchmarking.
Is this still waiting for benchmarking?
Initial results were inconclusive, with increases in ACK frames sent, presumed due to reduced internal buffering. I've therefore been waiting for @Matthias247 to finish his WIP delayed ACK support which should remove that variable.
How much overhead results from the use of atomic operations and mutexes? I wonder if a better approach would be to make everything !Send and !Sync and to use non-atomic operations. Users would be expected to shard their program across cores using SO_RESUEPORT or similar.
RDMA verbs are thread-safe, but I wonder if their performance is hurt less due to less reference counting.
How much overhead results from the use of atomic operations and mutexes
Approximately none. CPU time is mostly spent in crypto, syscalls, and memory management, IIRC (regardless of this PR). If you want to profile and experiment yourself, that would be very welcome!
I wonder if a better approach would be to make everything !Send and !Sync and to use non-atomic operations
Very early versions of quinn took this approach. It didn't seem to help.
Users would be expected to shard their program across cores using SO_RESUEPORT or similar.
This is appealing at first, but unfortunately routing UDP datagrams to the correct thread is difficult. Further, any genuinely large-scale deployment of QUIC will have a load-balancing front-end that renders such measures redundant, so the set of users who might stand to benefit seems very small.
Users would be expected to shard their program across cores using SO_RESUEPORT or similar.
This is appealing at first, but unfortunately routing UDP datagrams to the correct thread is difficult. Further, any genuinely large-scale deployment of QUIC will have a load-balancing front-end that renders such measures redundant, so the set of users who might stand to benefit seems very small.
My understanding is that load balancers can only send the packet to the correct device, and something else has to get the packet to the correct thread. This can be done using hardware or software RSS or a custom software eBPF program. One still needs to handle the occasional misrouted packet due to migration, but that is rare enough it can be a slow path.
That's incorrect, or at best needlessly complicated. A device with multiple independent threads can be seamlessly handled by a load balancer by giving each thread its own address. Reusing the same port only adds complexity.
That's incorrect, or at best needlessly complicated. A device with multiple independent threads can be seamlessly handled by a load balancer by giving each thread its own address. Reusing the same port only adds complexity.
That assumes you have unlimited addresses, but I see your point. Also forwarding of misdirected packets needs to be addressed.
That assumes you have unlimited addresses
It assumes you have more addresses available than CPU cores. Recall that UDP ports alone trivially give you ~2^16 addresses.
Also forwarding of misdirected packets needs to be addressed.
A load balancer routing packets deliberately does not have that problem.
Also forwarding of misdirected packets needs to be addressed.
A load balancer routing packets deliberately does not have that problem.
Depends on whether the load balancer can use connection IDs, or is only able to use the 5-tuple. Lots of switches are only capable of the latter, and hardware load balancers are much more expensive.
I don't think this discussion adds a lot to this PR, since it's completely orthogonal and might be better off in it's own topic.
But yes, Demi is right that a there can be different types of packet load balancers for horizontal scaling. And they might either just route by 5tuple (in which case connection migration would not necessarily work - since packets will be misrouted after migration), or by connection ID.
I don't think this discussion adds a lot to this PR, since it's completely orthogonal and might be better off in it's own topic.
But yes, Demi is right that a there can be different types of packet load balancers for horizontal scaling. And they might either just route by 5tuple (in which case connection migration would not necessarily work - since packets will be misrouted after migration), or by connection ID.
Migration is sufficiently rare that one can explicitly send misrouted packets to the correct server.
This has been rebased, but the DelayQueue it introduces depends on tokio::time::Sleep, so additional work is required to be runtime-independent.
Would it be helpful to land some of the smaller refactoring commits already, to clarify this PR?
Sure, I can split those out tonight.
Hi, I gave your branch a try while doing some benchmarks of my own. I did not find any significant difference in the transfer speed of a 500 MiB file.
Code I used: https://github.com/WilliamVenner/quictransfer/blob/master/src/main.rs
Rebased and addressed outstanding feedback. Note outstanding report of a performance degradation at https://github.com/libp2p/rust-libp2p/pull/2801#issuecomment-1244567213, to be investigated.
Rebased and addressed outstanding feedback. Note outstanding report of a performance degradation at https://github.com/libp2p/rust-libp2p/pull/2801#issuecomment-1244567213, to be investigated.
This is about efficiency, not about performance/throughput - right? There's no loss in that setup? I think the evaluation for efficiency should be made on the benchmarks in the quinn repository. Everything else just adds code on top of it which makes it harder to understand what actually causes the degradation.
I understood that post to be reporting decreased throughput when CPU-bound: 2412 Mbps without this branch, and 1913.57 Mbps with it. Certainly a minimal standalone repro would be more actionable.
Maybe it's because they use async-std in the benchmark which is multithreaded, and then crypto runs in a different thread than IO. In combination with them picking an order of magnitude worse performing cipher suite (ChaCha20) for an unknown reason, which makes the now combined IO/crypto thread doing more work.
When running this with uploads from multiple clients, I get a panic in the delay queue:
RUST_BACKTRACE=1 cargo run --release -- -c 24
--download-size 0 --upload-size 100M --stats
Finished release [optimized + debuginfo] target(s) in 1.55s
Running `/mnt/c/Users/matth/Code/rust/quinn/target/release/bulk -c 24 --download-size 0 --upload-size 100M --stats`
thread '<unnamed>' panicked at 'invalid key', quinn/src/delay_queue.rs:84:21
stack backtrace:
0: std::panicking::begin_panic
at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/std/src/panicking.rs:616:12
1: quinn::delay_queue::DelayQueue<T>::poll
2: quinn::endpoint::EndpointInner::drive_timers
at /mnt/c/Users/matth/Code/rust/quinn/quinn/src/endpoint.rs:499:39
3: <quinn::endpoint::EndpointDriver as core::future::future::Future>::poll
at /mnt/c/Users/matth/Code/rust/quinn/quinn/src/endpoint.rs:307:23
Debug build panics at a different place:
thread '<unnamed>' panicked at 'invalid key', quinn/src/delay_queue.rs:235:13
stack backtrace:
0: std::panicking::begin_panic
at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/std/src/panicking.rs:616:12
1: <slab::Slab<T> as core::ops::index::IndexMut<usize>>::index_mut
at /home/matthias/.cargo/registry/src/github.com-1ecc6299db9ec823/slab-0.4.7/src/lib.rs:1192:18
2: quinn::delay_queue::DelayQueue<T>::list_unlink
at /mnt/c/Users/matth/Code/rust/quinn/quinn/src/delay_queue.rs:235:13
3: quinn::delay_queue::DelayQueue<T>::unlink
at /mnt/c/Users/matth/Code/rust/quinn/quinn/src/delay_queue.rs:224:9
4: quinn::delay_queue::DelayQueue<T>::reset
at /mnt/c/Users/matth/Code/rust/quinn/quinn/src/delay_queue.rs:189:9
5: <quinn::endpoint::EndpointDriver as core::future::future::Future>::poll
at /mnt/c/Users/matth/Code/rust/quinn/quinn/src/endpoint.rs:331:29
and
thread '<unnamed>' panicked at 'assertion failed: `(left == right)`
left: `Some(Timer(19))`,
right: `None`: head of list has no predecessor', quinn/src/delay_queue.rs:218:13
stack backtrace:
0: rust_begin_unwind
at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/std/src/panicking.rs:584:5
1: core::panicking::panic_fmt
at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/core/src/panicking.rs:142:14
2: core::panicking::assert_failed_inner
3: core::panicking::assert_failed
at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/core/src/panicking.rs:181:5
4: quinn::delay_queue::DelayQueue<T>::unlink
at /mnt/c/Users/matth/Code/rust/quinn/quinn/src/delay_queue.rs:218:13
5: quinn::delay_queue::DelayQueue<T>::reset
at /mnt/c/Users/matth/Code/rust/quinn/quinn/src/delay_queue.rs:189:9
and
thread '<unnamed>' panicked at 'assertion failed: `(left == right)`
left: `None`,
right: `Some(Timer(9))`: successor links to head', quinn/src/delay_queue.rs:79:21
stack backtrace:
0: rust_begin_unwind
at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/std/src/panicking.rs:584:5
1: core::panicking::panic_fmt
at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/core/src/panicking.rs:142:14
2: core::panicking::assert_failed_inner
3: core::panicking::assert_failed
at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/core/src/panicking.rs:181:5
4: quinn::delay_queue::DelayQueue<T>::scan_bottom
at /mnt/c/Users/matth/Code/rust/quinn/quinn/src/delay_queue.rs:79:21
5: quinn::delay_queue::DelayQueue<T>::poll
at /mnt/c/Users/matth/Code/rust/quinn/quinn/src/delay_queue.rs:61:34
6: quinn::endpoint::EndpointInner::drive_timers
Due to the panics on all multi-connection tests, I only got benchark results for a single connection. Looks like an improvement however!
Main
Throughput: 580 MB/s RTT: Client 180 us, Server 170 us CWND Server: 1103736476 (extremely ridiculous)
Unify Drivers
Throughput: 610MB/s RTT: Client 240us, Server 440us CWND Server: 51262370
So throughput improves 5%, but latency is 33%-160% worse?
I wouldn't read too heavily into μs variations in latency. That's probably either noise or inconsequential constant variation.
Good catch on the multi-connection failures, will investigate.
reset was corrupting the intrusive list. Fixed.
I wouldn't read too heavily into μs variations in latency.
Agreed. It seems to switch between runs in the 100-500us range, and that's all fine. The multi-millisecond delays that occur in other tests are the more concerning ones, since they are symptoms of excessive buffering.
I tried now again with multiple connections, and it seem to work fine. Did a 24 connection (1 connection per core) upload and download. Results are pretty much equivalent between main and unify-drivers, which are both around:
24 connections
Throughput: Each Stream 13.37 MB/s. Total: 316 MB/s RTT: Client 72ms, Server 60ms CWND Server: 50651475
24 connections upload
Throughput: Each Stream 10.10 MB/s. Total: 242MB/s RTT: Client: 390us, Server 3.3ms CWND Client: 304000
Based on that I think merging is fine. And as a follow-up it could be investigated how to avoid the large RTT under heavy server load (I assume it's due to the connections all producing packets for the endpoint, and that one not being able to keep up with sending).
Note that the bulk benchmark uses a single thread for the server. If I use a multithreaded runtime for the server (but retain one thread per client) and use cores / 2 clients (so one core is available for each half of a connection) I see a ~15% speedup on main (presumably from parallel crypto), and a ~15% slowdown on this branch (presumably from inter-thread comms overhead and lock contention).
On the other hand, in a hypothetical configuration with a totally discrete server endpoint per core, we'd almost certainly see a linear speedup, on top of the baseline efficiency and consistency improvements on this branch. I think that's the case that makes sense to optimize for, since it scales better in large-scale use which must distribute load across machines anyway, and for very small-scale use like game servers where most CPU is being spent elsewhere, but it's not a 100% clean win, especially for naive/mid-scale users that won't have load balancing as a matter of course.
If you discount delay_queue.rs (which I think is mostly fair, since it's strongly encapsulated), this is net -119 LoC.