quinn icon indicating copy to clipboard operation
quinn copied to clipboard

Unify connection and endpoint drivers

Open Ralith opened this issue 4 years ago • 30 comments

Fixes https://github.com/quinn-rs/quinn/issues/914. This is a high-risk change, so we probably don't want it in 0.8, but I'm getting a head start on it all the same.

Ralith avatar Oct 30 '21 05:10 Ralith

This could probably use some more iteration, but is now functionally complete. Next step is careful benchmarking.

Ralith avatar Nov 17 '21 02:11 Ralith

Is this still waiting for benchmarking?

djc avatar Dec 29 '21 06:12 djc

Initial results were inconclusive, with increases in ACK frames sent, presumed due to reduced internal buffering. I've therefore been waiting for @Matthias247 to finish his WIP delayed ACK support which should remove that variable.

Ralith avatar Dec 29 '21 06:12 Ralith

How much overhead results from the use of atomic operations and mutexes? I wonder if a better approach would be to make everything !Send and !Sync and to use non-atomic operations. Users would be expected to shard their program across cores using SO_RESUEPORT or similar.

DemiMarie avatar May 08 '22 04:05 DemiMarie

RDMA verbs are thread-safe, but I wonder if their performance is hurt less due to less reference counting.

DemiMarie avatar May 08 '22 04:05 DemiMarie

How much overhead results from the use of atomic operations and mutexes

Approximately none. CPU time is mostly spent in crypto, syscalls, and memory management, IIRC (regardless of this PR). If you want to profile and experiment yourself, that would be very welcome!

I wonder if a better approach would be to make everything !Send and !Sync and to use non-atomic operations

Very early versions of quinn took this approach. It didn't seem to help.

Users would be expected to shard their program across cores using SO_RESUEPORT or similar.

This is appealing at first, but unfortunately routing UDP datagrams to the correct thread is difficult. Further, any genuinely large-scale deployment of QUIC will have a load-balancing front-end that renders such measures redundant, so the set of users who might stand to benefit seems very small.

Ralith avatar May 08 '22 04:05 Ralith

Users would be expected to shard their program across cores using SO_RESUEPORT or similar.

This is appealing at first, but unfortunately routing UDP datagrams to the correct thread is difficult. Further, any genuinely large-scale deployment of QUIC will have a load-balancing front-end that renders such measures redundant, so the set of users who might stand to benefit seems very small.

My understanding is that load balancers can only send the packet to the correct device, and something else has to get the packet to the correct thread. This can be done using hardware or software RSS or a custom software eBPF program. One still needs to handle the occasional misrouted packet due to migration, but that is rare enough it can be a slow path.

DemiMarie avatar May 08 '22 05:05 DemiMarie

That's incorrect, or at best needlessly complicated. A device with multiple independent threads can be seamlessly handled by a load balancer by giving each thread its own address. Reusing the same port only adds complexity.

Ralith avatar May 08 '22 16:05 Ralith

That's incorrect, or at best needlessly complicated. A device with multiple independent threads can be seamlessly handled by a load balancer by giving each thread its own address. Reusing the same port only adds complexity.

That assumes you have unlimited addresses, but I see your point. Also forwarding of misdirected packets needs to be addressed.

DemiMarie avatar May 10 '22 14:05 DemiMarie

That assumes you have unlimited addresses

It assumes you have more addresses available than CPU cores. Recall that UDP ports alone trivially give you ~2^16 addresses.

Also forwarding of misdirected packets needs to be addressed.

A load balancer routing packets deliberately does not have that problem.

Ralith avatar May 10 '22 16:05 Ralith

Also forwarding of misdirected packets needs to be addressed.

A load balancer routing packets deliberately does not have that problem.

Depends on whether the load balancer can use connection IDs, or is only able to use the 5-tuple. Lots of switches are only capable of the latter, and hardware load balancers are much more expensive.

DemiMarie avatar May 10 '22 17:05 DemiMarie

I don't think this discussion adds a lot to this PR, since it's completely orthogonal and might be better off in it's own topic.

But yes, Demi is right that a there can be different types of packet load balancers for horizontal scaling. And they might either just route by 5tuple (in which case connection migration would not necessarily work - since packets will be misrouted after migration), or by connection ID.

Matthias247 avatar May 10 '22 17:05 Matthias247

I don't think this discussion adds a lot to this PR, since it's completely orthogonal and might be better off in it's own topic.

But yes, Demi is right that a there can be different types of packet load balancers for horizontal scaling. And they might either just route by 5tuple (in which case connection migration would not necessarily work - since packets will be misrouted after migration), or by connection ID.

Migration is sufficiently rare that one can explicitly send misrouted packets to the correct server.

DemiMarie avatar May 31 '22 18:05 DemiMarie

This has been rebased, but the DelayQueue it introduces depends on tokio::time::Sleep, so additional work is required to be runtime-independent.

Ralith avatar Aug 15 '22 17:08 Ralith

Would it be helpful to land some of the smaller refactoring commits already, to clarify this PR?

djc avatar Aug 30 '22 09:08 djc

Sure, I can split those out tonight.

Ralith avatar Aug 30 '22 16:08 Ralith

Hi, I gave your branch a try while doing some benchmarks of my own. I did not find any significant difference in the transfer speed of a 500 MiB file.

Code I used: https://github.com/WilliamVenner/quictransfer/blob/master/src/main.rs

WilliamVenner avatar Sep 03 '22 16:09 WilliamVenner

Rebased and addressed outstanding feedback. Note outstanding report of a performance degradation at https://github.com/libp2p/rust-libp2p/pull/2801#issuecomment-1244567213, to be investigated.

Ralith avatar Sep 17 '22 19:09 Ralith

Rebased and addressed outstanding feedback. Note outstanding report of a performance degradation at https://github.com/libp2p/rust-libp2p/pull/2801#issuecomment-1244567213, to be investigated.

This is about efficiency, not about performance/throughput - right? There's no loss in that setup? I think the evaluation for efficiency should be made on the benchmarks in the quinn repository. Everything else just adds code on top of it which makes it harder to understand what actually causes the degradation.

Matthias247 avatar Sep 18 '22 18:09 Matthias247

I understood that post to be reporting decreased throughput when CPU-bound: 2412 Mbps without this branch, and 1913.57 Mbps with it. Certainly a minimal standalone repro would be more actionable.

Ralith avatar Sep 18 '22 19:09 Ralith

Maybe it's because they use async-std in the benchmark which is multithreaded, and then crypto runs in a different thread than IO. In combination with them picking an order of magnitude worse performing cipher suite (ChaCha20) for an unknown reason, which makes the now combined IO/crypto thread doing more work.

Matthias247 avatar Sep 19 '22 06:09 Matthias247

When running this with uploads from multiple clients, I get a panic in the delay queue:

 RUST_BACKTRACE=1 cargo run --release -- -c 24
 --download-size 0 --upload-size 100M --stats

    Finished release [optimized + debuginfo] target(s) in 1.55s
     Running `/mnt/c/Users/matth/Code/rust/quinn/target/release/bulk -c 24 --download-size 0 --upload-size 100M --stats`

thread '<unnamed>' panicked at 'invalid key', quinn/src/delay_queue.rs:84:21
stack backtrace:
   0: std::panicking::begin_panic
             at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/std/src/panicking.rs:616:12
   1: quinn::delay_queue::DelayQueue<T>::poll
   2: quinn::endpoint::EndpointInner::drive_timers
             at /mnt/c/Users/matth/Code/rust/quinn/quinn/src/endpoint.rs:499:39
   3: <quinn::endpoint::EndpointDriver as core::future::future::Future>::poll
             at /mnt/c/Users/matth/Code/rust/quinn/quinn/src/endpoint.rs:307:23

Matthias247 avatar Sep 19 '22 07:09 Matthias247

Debug build panics at a different place:

thread '<unnamed>' panicked at 'invalid key', quinn/src/delay_queue.rs:235:13
stack backtrace:
   0: std::panicking::begin_panic
             at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/std/src/panicking.rs:616:12
   1: <slab::Slab<T> as core::ops::index::IndexMut<usize>>::index_mut
             at /home/matthias/.cargo/registry/src/github.com-1ecc6299db9ec823/slab-0.4.7/src/lib.rs:1192:18
   2: quinn::delay_queue::DelayQueue<T>::list_unlink
             at /mnt/c/Users/matth/Code/rust/quinn/quinn/src/delay_queue.rs:235:13
   3: quinn::delay_queue::DelayQueue<T>::unlink
             at /mnt/c/Users/matth/Code/rust/quinn/quinn/src/delay_queue.rs:224:9
   4: quinn::delay_queue::DelayQueue<T>::reset
             at /mnt/c/Users/matth/Code/rust/quinn/quinn/src/delay_queue.rs:189:9
   5: <quinn::endpoint::EndpointDriver as core::future::future::Future>::poll
             at /mnt/c/Users/matth/Code/rust/quinn/quinn/src/endpoint.rs:331:29

and

thread '<unnamed>' panicked at 'assertion failed: `(left == right)`
  left: `Some(Timer(19))`,
 right: `None`: head of list has no predecessor', quinn/src/delay_queue.rs:218:13
stack backtrace:
   0: rust_begin_unwind
             at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/std/src/panicking.rs:584:5
   1: core::panicking::panic_fmt
             at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/core/src/panicking.rs:142:14
   2: core::panicking::assert_failed_inner
   3: core::panicking::assert_failed
             at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/core/src/panicking.rs:181:5
   4: quinn::delay_queue::DelayQueue<T>::unlink
             at /mnt/c/Users/matth/Code/rust/quinn/quinn/src/delay_queue.rs:218:13
   5: quinn::delay_queue::DelayQueue<T>::reset
             at /mnt/c/Users/matth/Code/rust/quinn/quinn/src/delay_queue.rs:189:9

and

thread '<unnamed>' panicked at 'assertion failed: `(left == right)`
  left: `None`,
 right: `Some(Timer(9))`: successor links to head', quinn/src/delay_queue.rs:79:21
stack backtrace:
   0: rust_begin_unwind
             at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/std/src/panicking.rs:584:5
   1: core::panicking::panic_fmt
             at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/core/src/panicking.rs:142:14
   2: core::panicking::assert_failed_inner
   3: core::panicking::assert_failed
             at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/core/src/panicking.rs:181:5
   4: quinn::delay_queue::DelayQueue<T>::scan_bottom
             at /mnt/c/Users/matth/Code/rust/quinn/quinn/src/delay_queue.rs:79:21
   5: quinn::delay_queue::DelayQueue<T>::poll
             at /mnt/c/Users/matth/Code/rust/quinn/quinn/src/delay_queue.rs:61:34
   6: quinn::endpoint::EndpointInner::drive_timers

Matthias247 avatar Sep 19 '22 07:09 Matthias247

Due to the panics on all multi-connection tests, I only got benchark results for a single connection. Looks like an improvement however!

Main

Throughput: 580 MB/s RTT: Client 180 us, Server 170 us CWND Server: 1103736476 (extremely ridiculous)

Unify Drivers

Throughput: 610MB/s RTT: Client 240us, Server 440us CWND Server: 51262370

Matthias247 avatar Sep 19 '22 07:09 Matthias247

So throughput improves 5%, but latency is 33%-160% worse?

djc avatar Sep 19 '22 09:09 djc

I wouldn't read too heavily into μs variations in latency. That's probably either noise or inconsequential constant variation.

Good catch on the multi-connection failures, will investigate.

Ralith avatar Sep 19 '22 19:09 Ralith

reset was corrupting the intrusive list. Fixed.

Ralith avatar Sep 19 '22 20:09 Ralith

I wouldn't read too heavily into μs variations in latency.

Agreed. It seems to switch between runs in the 100-500us range, and that's all fine. The multi-millisecond delays that occur in other tests are the more concerning ones, since they are symptoms of excessive buffering.

I tried now again with multiple connections, and it seem to work fine. Did a 24 connection (1 connection per core) upload and download. Results are pretty much equivalent between main and unify-drivers, which are both around:

24 connections

Throughput: Each Stream 13.37 MB/s. Total: 316 MB/s RTT: Client 72ms, Server 60ms CWND Server: 50651475

24 connections upload

Throughput: Each Stream 10.10 MB/s. Total: 242MB/s RTT: Client: 390us, Server 3.3ms CWND Client: 304000

Based on that I think merging is fine. And as a follow-up it could be investigated how to avoid the large RTT under heavy server load (I assume it's due to the connections all producing packets for the endpoint, and that one not being able to keep up with sending).

Matthias247 avatar Sep 20 '22 04:09 Matthias247

Note that the bulk benchmark uses a single thread for the server. If I use a multithreaded runtime for the server (but retain one thread per client) and use cores / 2 clients (so one core is available for each half of a connection) I see a ~15% speedup on main (presumably from parallel crypto), and a ~15% slowdown on this branch (presumably from inter-thread comms overhead and lock contention).

On the other hand, in a hypothetical configuration with a totally discrete server endpoint per core, we'd almost certainly see a linear speedup, on top of the baseline efficiency and consistency improvements on this branch. I think that's the case that makes sense to optimize for, since it scales better in large-scale use which must distribute load across machines anyway, and for very small-scale use like game servers where most CPU is being spent elsewhere, but it's not a 100% clean win, especially for naive/mid-scale users that won't have load balancing as a matter of course.

Ralith avatar Sep 20 '22 19:09 Ralith

If you discount delay_queue.rs (which I think is mostly fair, since it's strongly encapsulated), this is net -119 LoC.

Ralith avatar Sep 21 '22 00:09 Ralith