quinn icon indicating copy to clipboard operation
quinn copied to clipboard

Performance issue on internet

Open stormshield-damiend opened this issue 5 months ago • 10 comments

This is similar to what discussed in #2153 .

Here are some tests we made on the following setup :

  • Two VMs in two different datacenters in the cloud
  • Latency between them is around 10ms (ping)
  • Network driver is virtio_net with generic-segmentation-offload: on, generic-receive-offload: on and tx-udp-segmentation: off [fixed] (from ethtool)
  • We test uploading from one VM to the other

First here is the result of an iperf3 test between the two VMs (TCP non ciphered)

[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  1.12 GBytes   958 Mbits/sec  1102             sender
[  5]   0.00-10.01  sec  1.11 GBytes   955 Mbits/sec                  receiver

This matches the expected line rate between the two VMs.

We then use secnetperf from msquic project (UDP with encryption) https://github.com/microsoft/msquic/tree/main/src/perf

Result: Upload 1298923520 bytes @ 913880 kbps (11370.619 ms).

We are close to the line rate.

We then used quinn/perf client and server (UDP with encryption)

Overall stats:
RPS: 37.25 (373 requests in 10.01s)

Stream metrics:

      │ Upload Duration │ Download Duration | FBL        | Upload Throughput | Download Throughput
──────┼─────────────────┼───────────────────┼────────────┼───────────────────┼────────────────────
 AVG  │         26.71ms │           57.00µs │    57.00µs │       323.99 Mb/s │          0.00 Mb/s
 P0   │         18.21ms │            1.00µs │     1.00µs │        99.09 Mb/s │          0.00 Mb/s
 P10  │         22.98ms │            2.00µs │     2.00µs │       287.31 Mb/s │          0.00 Mb/s
 P50  │         25.30ms │           47.00µs │    47.00µs │       331.87 Mb/s │          0.00 Mb/s
 P90  │         29.20ms │          124.00µs │   124.00µs │       365.43 Mb/s │          0.00 Mb/s
 P100 │         84.67ms │          398.00µs │   398.00µs │       460.59 Mb/s │          0.00 Mb/s

We are around half the line rate.

If we disable crypto using --no-protection

Overall stats:
RPS: 38.84 (389 requests in 10.02s)

Stream metrics:

      │ Upload Duration │ Download Duration | FBL        | Upload Throughput | Download Throughput
──────┼─────────────────┼───────────────────┼────────────┼───────────────────┼────────────────────
 AVG  │         25.58ms │           76.00µs │    76.00µs │       337.91 Mb/s │          0.00 Mb/s
 P0   │         15.38ms │            1.00µs │     1.00µs │       100.66 Mb/s │          0.00 Mb/s
 P10  │         21.22ms │            5.00µs │     4.00µs │       283.64 Mb/s │          0.00 Mb/s
 P50  │         24.70ms │           70.00µs │    70.00µs │       339.74 Mb/s │          0.00 Mb/s
 P90  │         29.60ms │          143.00µs │   143.00µs │       395.84 Mb/s │          0.00 Mb/s
 P100 │         83.33ms │          578.00µs │   578.00µs │       545.26 Mb/s │          0.00 Mb/s

We get a bit better but are still far from the line rate

I can dedicate some work time to work on this issue, if you have any insight feel free to comment.

stormshield-damiend avatar Jun 05 '25 06:06 stormshield-damiend

Secnetperf socket parameters (from ss -um) : skmem:(r0,rb425984,t0,tb212992,f4096,w0,o0,bl0,d0) Quinn perf socket parameters (from ss -lnpum) : skmem:(r0,rb425984,t0,tb425984,f4096,w0,o0,bl0,d0)

So this does not seem related to socket recv / send buffer size.

stormshield-damiend avatar Jun 05 '25 08:06 stormshield-damiend

Thanks for digging into performance like this!

What packet loss are you observing on this link? What congestion controller are you using? Are you CPU bound or I/O bound? Could allocations/buffering be a relevant constraint here?

djc avatar Jun 05 '25 08:06 djc

No problem.

I will measure packet loss or print what quinn / secnetperf sees.

Congestion controller is default for quinn so Cubic, msquic/secnetperf support both (cubic is default) , i will do a measurement for cc=bbr.

For the cpu usage during the test : quinn : client around 15% / server around 35% secnetperf: client around 25% / server around 55%

I can try with other allocators (tikv-jemalloc / mimalloc / ...) and look at what allocator is using msquic/secnetperf as default.

stormshield-damiend avatar Jun 05 '25 09:06 stormshield-damiend

Msquic/secnetperf with cc=bbr : Result: Upload 1760428032 bytes @ 1365141 kbps (10316.459 ms). So it is faster than default which is cc=cubic.

stormshield-damiend avatar Jun 05 '25 09:06 stormshield-damiend

Connection stats for quinn/perf (upload)

  • Client side
Overall stats:
RPS: 34.96 (350 requests in 10.01s)

Stream metrics:

      │ Upload Duration │ Download Duration | FBL        | Upload Throughput | Download Throughput
──────┼─────────────────┼───────────────────┼────────────┼───────────────────┼────────────────────
 AVG  │         28.44ms │           69.00µs │    69.00µs │       309.47 Mb/s │          0.00 Mb/s
 P0   │         17.28ms │            1.00µs │     1.00µs │       100.01 Mb/s │          0.00 Mb/s
 P10  │         21.62ms │            2.00µs │     2.00µs │       231.21 Mb/s │          0.00 Mb/s
 P50  │         26.90ms │           69.00µs │    69.00µs │       311.95 Mb/s │          0.00 Mb/s
 P90  │         36.00ms │          140.00µs │   140.00µs │       388.24 Mb/s │          0.00 Mb/s
 P100 │         83.90ms │          320.00µs │   320.00µs │       485.49 Mb/s │          0.00 Mb/s

ConnectionStats { 
udp_tx: UdpStats { datagrams: 260483, bytes: 376892168, ios: 28741 },

udp_rx: UdpStats { datagrams: 9150, bytes: 376861, ios: 9151 },

frame_tx: FrameStats { ACK: 1055, ACK_FREQUENCY: 0, CONNECTION_CLOSE: 0, CRYPTO: 2, DATA_BLOCKED: 0, DATAGRAM: 0, HANDSHAKE_DONE: 0, IMMEDIATE_ACK: 4, MAX_DATA: 0, MAX_STREAM_DATA: 0, MAX_STREAMS_BIDI: 0, MAX_STREAMS_UNI: 0, NEW_CONNECTION_ID: 5, NEW_TOKEN: 0, PATH_CHALLENGE: 0, PATH_RESPONSE: 0, PING: 4, RESET_STREAM: 0, RETIRE_CONNECTION_ID: 0, STREAM_DATA_BLOCKED: 0, STREAMS_BLOCKED_BIDI: 0, STREAMS_BLOCKED_UNI: 0, STOP_SENDING: 0, STREAM: 259755 },

frame_rx: FrameStats { ACK: 7949, ACK_FREQUENCY: 0, CONNECTION_CLOSE: 0, CRYPTO: 3, DATA_BLOCKED: 0, DATAGRAM: 0, HANDSHAKE_DONE: 1, IMMEDIATE_ACK: 4, MAX_DATA: 0, MAX_STREAM_DATA: 1680, MAX_STREAMS_BIDI: 26, MAX_STREAMS_UNI: 0, NEW_CONNECTION_ID: 4, NEW_TOKEN: 2, PATH_CHALLENGE: 0, PATH_RESPONSE: 0, PING: 4, RESET_STREAM: 0, RETIRE_CONNECTION_ID: 1, STREAM_DATA_BLOCKED: 0, STREAMS_BLOCKED_BIDI: 0, STREAMS_BLOCKED_UNI: 0, STOP_SENDING: 0, STREAM: 350 }, 

path: PathStats { rtt: 10.027276ms, cwnd: 687939, congestion_events: 4, lost_packets: 24, lost_bytes: 34848, sent_packets: 260484, sent_plpmtud_probes: 4, lost_plpmtud_probes: 0, black_holes_detected: 0, current_mtu: 1452 }
 }
  • Server Side
ConnectionStats {
udp_tx: UdpStats { datagrams: 9303, bytes: 383096, ios: 9303 },

udp_rx: UdpStats { datagrams: 265194, bytes: 383699307, ios: 265195 },

frame_tx: FrameStats { ACK: 8079, ACK_FREQUENCY: 0, CONNECTION_CLOSE: 0, CRYPTO: 3, DATA_BLOCKED: 0, DATAGRAM: 0, HANDSHAKE_DONE: 1, IMMEDIATE_ACK: 4, MAX_DATA: 0, MAX_STREAM_DATA: 1710, MAX_STREAMS_BIDI: 27, MAX_STREAMS_UNI: 0, NEW_CONNECTION_ID: 4, NEW_TOKEN: 2, PATH_CHALLENGE: 0, PATH_RESPONSE: 0, PING: 4, RESET_STREAM: 0, RETIRE_CONNECTION_ID: 1, STREAM_DATA_BLOCKED: 0, STREAMS_BLOCKED_BIDI: 0, STREAMS_BLOCKED_UNI: 0, STOP_SENDING: 0, STREAM: 357 },

frame_rx: FrameStats { ACK: 1075, ACK_FREQUENCY: 0, CONNECTION_CLOSE: 1, CRYPTO: 2, DATA_BLOCKED: 0, DATAGRAM: 0, HANDSHAKE_DONE: 0, IMMEDIATE_ACK: 4, MAX_DATA: 0, MAX_STREAM_DATA: 0, MAX_STREAMS_BIDI: 0, MAX_STREAMS_UNI: 0, NEW_CONNECTION_ID: 5, NEW_TOKEN: 0, PATH_CHALLENGE: 0, PATH_RESPONSE: 0, PING: 4, RESET_STREAM: 0, RETIRE_CONNECTION_ID: 0, STREAM_DATA_BLOCKED: 0, STREAMS_BLOCKED_BIDI: 0, STREAMS_BLOCKED_UNI: 0, STOP_SENDING: 0, STREAM: 264447 },

path: PathStats { rtt: 9.62579ms, cwnd: 12000, congestion_events: 0, lost_packets: 0, lost_bytes: 0, sent_packets: 9303, sent_plpmtud_probes: 4, lost_plpmtud_probes: 0, black_holes_detected: 0, current_mtu: 1452 }
 }

From client to server 4 congestions events / 24 lost packets / 34848 bytes.

stormshield-damiend avatar Jun 05 '25 09:06 stormshield-damiend

It looks like you're comparing connection-level stats in iperf3/secnetperf to stream-level stats in Quinn. You also don't seem to be using the default perf configuration. What is the exact command line you're passing to the perf client? If you're using any stream concurrency, then the per-stream throughput will be much lower than the connection-level throughput, because multiple streams are running concurrently. If you aren't using any stream concurrency, then the pipeline is stalling as each "request" waits for a response from the server before the next is issued (see perf client behavior). Judging by sent_packets * current_mtu, the latter seems most likely. At 350 requests * 10ms round trip (minimum), that's accounting for more than a third of your total runtime.

If you want to run a simple "blast bytes down the stream as fast as possible" benchmark, you should set a very large upload and/or download size so you don't spend time significant time waiting for responses, or confuse matters with concurrent streams.

Ralith avatar Jun 06 '25 00:06 Ralith

Hi, when setting a high value for upload or download in perf client, too high to be reached before the duration, no stats are displayed. We made a quick patch using a CancellationToken to have all futures stop correctly but it is not finished for now.

Anyway here is the upload performance for a 10sec upload using a single stream :

Here is the arguments given to perf_client : --duration 10 --download-size 0 --upload-size 1048576000 --conn-stats

Client

ConnectionStats {
udp_tx: UdpStats { datagrams: 292333, bytes: 424455184, ios: 35585 },

udp_rx: UdpStats { datagrams: 15306, bytes: 632183, ios: 15307 },

frame_tx: FrameStats { ACK: 1469, ACK_FREQUENCY: 0, CONNECTION_CLOSE: 0, CRYPTO: 2, DATA_BLOCKED: 0, DATAGRAM: 0, HANDSHAKE_DONE: 0, IMMEDIATE_ACK: 4, MAX_DATA: 0, MAX_STREAM_DATA: 0, MAX_STREAMS_BIDI: 0, MAX_STREAMS_UNI: 0, NEW_CONNECTION_ID: 5, NEW_TOKEN: 0, PATH_CHALLENGE: 0, PATH_RESPONSE: 0, PING: 4, RESET_STREAM: 0, RETIRE_CONNECTION_ID: 0, STREAM_DATA_BLOCKED: 0, STREAMS_BLOCKED_BIDI: 0, STREAMS_BLOCKED_UNI: 0, STOP_SENDING: 0, STREAM: 292327 },

frame_rx: FrameStats { ACK: 14252, ACK_FREQUENCY: 0, CONNECTION_CLOSE: 0, CRYPTO: 3, DATA_BLOCKED: 0, DATAGRAM: 0, HANDSHAKE_DONE: 1, IMMEDIATE_ACK: 4, MAX_DATA: 0, MAX_STREAM_DATA: 2267, MAX_STREAMS_BIDI: 0, MAX_STREAMS_UNI: 0, NEW_CONNECTION_ID: 4, NEW_TOKEN: 2, PATH_CHALLENGE: 0, PATH_RESPONSE: 0, PING: 4, RESET_STREAM: 0, RETIRE_CONNECTION_ID: 1, STREAM_DATA_BLOCKED: 0, STREAMS_BLOCKED_BIDI: 0, STREAMS_BLOCKED_UNI: 0, STOP_SENDING: 0, STREAM: 0 },

path: PathStats { rtt: 9.547958ms, cwnd: 355389, congestion_events: 11, lost_packets: 490, lost_bytes: 711480, sent_packets: 292334, sent_plpmtud_probes: 4, lost_plpmtud_probes: 0, black_holes_detected: 0, current_mtu: 1452 } 
}

2025-06-10T15:28:37.184412Z  INFO perf_client: shutting down (timeout)
2025-06-10T15:28:37.185629Z  WARN perf_client: request: leaving
2025-06-10T15:28:37.217222Z  INFO perf_client: stream client bidirectional stream 0 uploaded 421527552 (bytes) elapsed 10.233532888 (secs) throughput 329.53 Mb/s

Here are the corresponding stats for the server side :

ConnectionStats {
udp_tx: UdpStats { datagrams: 15708, bytes: 648697, ios: 15708 },

udp_rx: UdpStats { datagrams: 297263, bytes: 431609556, ios: 297264 },

frame_tx: FrameStats { ACK: 14637, ACK_FREQUENCY: 0, CONNECTION_CLOSE: 0, CRYPTO: 3, DATA_BLOCKED: 0, DATAGRAM: 0, HANDSHAKE_DONE: 1, IMMEDIATE_ACK: 4, MAX_DATA: 0, MAX_STREAM_DATA: 2312, MAX_STREAMS_BIDI: 0, MAX_STREAMS_UNI: 0, NEW_CONNECTION_ID: 4, NEW_TOKEN: 2, PATH_CHALLENGE: 0, PATH_RESPONSE: 0, PING: 4, RESET_STREAM: 0, RETIRE_CONNECTION_ID: 1, STREAM_DATA_BLOCKED: 0, STREAMS_BLOCKED_BIDI: 0, STREAMS_BLOCKED_UNI: 0, STOP_SENDING: 0, STREAM: 1 },

frame_rx: FrameStats { ACK: 1505, ACK_FREQUENCY: 0, CONNECTION_CLOSE: 0, CRYPTO: 2, DATA_BLOCKED: 0, DATAGRAM: 0, HANDSHAKE_DONE: 0, IMMEDIATE_ACK: 4, MAX_DATA: 0, MAX_STREAM_DATA: 0, MAX_STREAMS_BIDI: 0, MAX_STREAMS_UNI: 0, NEW_CONNECTION_ID: 5, NEW_TOKEN: 0, PATH_CHALLENGE: 0, PATH_RESPONSE: 0, PING: 4, RESET_STREAM: 0, RETIRE_CONNECTION_ID: 0, STREAM_DATA_BLOCKED: 0, STREAMS_BLOCKED_BIDI: 0, STREAMS_BLOCKED_UNI: 0, STOP_SENDING: 0, STREAM: 297256 },

path: PathStats { rtt: 9.824697ms, cwnd: 12000, congestion_events: 0, lost_packets: 0, lost_bytes: 0, sent_packets: 15709, sent_plpmtud_probes: 4, lost_plpmtud_probes: 0, black_holes_detected: 0, current_mtu: 1452 } }

Next move is to look at TransportConfiguration and compare it to secnetperf (i already spotted they use a far more aggressive ack delay value than the default one from quinn).

stormshield-damiend avatar Jun 10 '25 15:06 stormshield-damiend

Thanks for the update! I'm surprised that getting rid of the pipeline bubbles had so little effect. Maybe the original per-stream statistics were already excluding that well enough.

Another area to compare between implementations is congestion controller configuration. A larger initial congestion window, for example, can dramatically improve initial performance on high-bandwidth links.

Given the low quinn CPU usage you report, especially compared to secnetperf's, it seems clear that the bottleneck is quinn choosing not to send more data, which suggests an issue in pacing, congestion control, or flow control. In particular, our Cubic implementation is fairly naive (e.g. we haven't implemented HyStart++), so there may be low hanging fruit there. Computing the proportion of time spent working vs. waiting for pacing vs. waiting for congestion control budget vs. waiting for flow control budget might also help.

Ralith avatar Jun 10 '25 21:06 Ralith

If you want to take a look at msquic setting, they are all documented here https://github.com/microsoft/msquic/blob/main/docs/api/QUIC_SETTINGS.md

stormshield-damiend avatar Jun 11 '25 07:06 stormshield-damiend

As discussed on discord here is a plot of the evolution of the congestion window during a 600 sec upload test (i just used --conn-stats to extract and plot the value

Image

This does not look good.

Throughput was 349.77 Mb/s.

stormshield-damiend avatar Jun 12 '25 11:06 stormshield-damiend

By the way this confirm that quinn perf is blocked by congestion controller

2025-06-24T07:22:51.816788Z TRACE drive{id=0}: quinn_proto::connection: blocked by congestion control
2025-06-24T07:22:51.817842Z TRACE drive{id=0}:send{space=Data pn=15567}: quinn_proto::connection: blocked by congestion control
2025-06-24T07:22:51.817857Z TRACE drive{id=0}: quinn_proto::connection: blocked by congestion control
2025-06-24T07:22:51.818487Z TRACE drive{id=0}:send{space=Data pn=15574}: quinn_proto::connection: blocked by congestion control
2025-06-24T07:22:51.818516Z TRACE drive{id=0}: quinn_proto::connection: blocked by congestion control

stormshield-damiend avatar Jun 24 '25 07:06 stormshield-damiend

Well, that confirms it was blocked at two specific points in time. We really need to chart/aggregate state over time to draw meaningful conclusions -- for example, compare time spent waiting for congestion budget to time spent waiting for pacing or flow control. Even that's subtle because we should usually be blocked on congestion control -- just not so much that we can't achieve the expected throughput.

Regardless, from @stormshield-fabs's data, I think we can reasonably conclude that the congestion controller is doing something wrong, perhaps growing too slowly. A good next step might be trying to reproduce the bad behavior in a narrow unit test, and if that fails, isolating what's different between the unit test scenario and the real-world slow workload.

Ralith avatar Jun 25 '25 02:06 Ralith

MR around this topic (for reference) : #2278 #2289

stormshield-damiend avatar Jul 03 '25 08:07 stormshield-damiend

Some new tests.

First some context :

  • in #2300 i changed the way data are uploaded from client to server to write as many chunks of preallocated data using write_chunks().
  • i then used a modified version of #1659 to have a reproducible test environment with stable latency and no packet drops as the tests we ran on internet depend on path latency / drops and load.

The goal was to compare the impact of latency on throughput on various Quic stack, i tested Quinn based on #2300, secnetperf from msquic (a commit few months ago) and latest master version of s2n-quic.

Here are the results :

|------------+---------------+--------------+--------------+--------------+---------------|
|            | 0ms           | 10ms (-l 5)  | 20ms (-l 10) | 40ms (-l 20) | 100ms (-l 50) |
|------------+---------------+--------------+--------------+--------------+---------------|
| quinn      | 11795.78 Mb/s | 918.29 Mb/s  | 465.44 Mb/s  | 231.67 Mb/s  | 95.13 Mb/s    |
|            | 12276.73 Mb/s | 915.14 Mb/s  | 466.22 Mb/s  | 235.21 Mb/s  | 94.67 Mb/s    |
|            | 12359.78 Mb/s | 914.10 Mb/s  | 464.39 Mb/s  | 234.82 Mb/s  |               |
|------------+---------------+--------------+--------------+--------------+---------------|
| secnetperf | 14835133 kbps | 5248351 kbps | 4299376 kbps | 4742731 kbps | 2811823 kbps  |
|            | 16351173 kbps | 5453510 kbps | 5880245 kbps | 4565408 kbps | 2237253 kbps  |
|------------+---------------+--------------+--------------+--------------+---------------|
| s2n-quic   | 13.01Gbps     | 1.90Gbps     | 1.25Gbps     | 602.96Mbps   | 285.51Mbps    |
|            | 13.05Gbps     | 1.85Gbps     | 1.29Gbps     | 605.37Mbps   | 291.70Mbps    |
|------------+---------------+--------------+--------------+--------------+---------------|

We can see that all stacks suffer from latency and achieve a lower throughput the more latency there is, but quinn seems to be far more impacted than other stacks.

stormshield-damiend avatar Aug 12 '25 15:08 stormshield-damiend

More tests done today with @stormshield-fabs, here are some results :

First, simulated network has 0 packet loss and a latency of 10ms. Second, internet network has some packet loss (not measured) and a latency of 15-20ms.

  1. We tried disabling pacer / CC / flow control connection + stream without success, no packets are emitted (flow control patch seem to be wrong)
  2. We tried disabling pacer alone and saw smaller performance results on both internet and simulated
  3. We tried disabling pacer / CC (On internet we saw a 20% increase, on simulated network we saw no gain)
  4. We implemented a constant CC which always reply a fixed value (On internet we saw a 40% gain, on simulated network we saw nothing)
  5. We tried with stream_recv_window = send_window = UMAX with cubic old with spurious loss detection (on internet nothing changed, on simulated we reached 2.5 Gbps on a 10ms latency)
  6. We tried with stream_recv_window = send_window = UMAX and using constant CC = UMAX (on internet we reached 1 Gbps, on simulated network 10ms we reached 10 Gbps)
  7. We tried with stream_recv_window = send_window = UMAX, constant CC = UMAX and pacer disabled (on internet we also reached 1 Gbps , on simulated network 10ms we reached 10 Gbps)
  8. We tried with stream_recv_window = send_window = UMAX, cubic old with initial windows = UMAX (on internet we reach 1 Gbps for some seconds then value lowers to smaller value to 500 Mbps, on simulated network we reached 10 Gbps)

Interpretation : 4 suggest that on a non optimal link (internet with possible loss and packet reordering) we may trigger loss detection too often leading to the CC window lowering leading to lower throughput (just and intuition for now) 5 and 6 seem to be the best in term of gain but using the Constant CC change also the pacer so hard to find which modification is responsible for the gain 6 and 7 suggest that the pacer is not part of the issue 7 and 8 we are not sure what it suggest ;)

stormshield-damiend avatar Aug 13 '25 14:08 stormshield-damiend

Interesting. I concur with your analysis that the pacer isn't a relevant bottleneck. 7 is unremarkable: a configuration which already saturates the path shouldn't change if we remove even more constraints, except possibly to degrade if something gets overwhelmed. For 8, I'd guess that the simulated network is buffering to an unrealistic event, whereas on the internet severe congestion arises which triggers aggressive backoff.

Just to double-check, when setting flow control limits to usize::MAX, you also set TransportConfig::receive_window? Only changing stream-level limits while the connection-level limit is at default wouldn't have a huge effect. Either you did get this right or it doesn't matter since you observed good throughput in some configurations anyway.

Overall this supports the hypothesis that our congestion control/loss recovery is responsible for poor throughput. Brainstorming some possibilities:

  • We're still(!) detecting more congestion/loss events than other implementations (testable by comparing traces/statistics between impls)
  • We're responding to a similar number of congestion/loss events more conservatively (testable by driving different implementations' congestion controllers with identical workloads in pure unit tests, or more roughly by eyeballing charted cwnd changes)
  • Loss detection and congestion control are working fine, but we're failing to fill the whole congestion window (testable by comparing congestion control window to actual in-flight data size by analysis of Quinn qlogs)
  • Some difference in configuration between implementations

Ralith avatar Aug 13 '25 22:08 Ralith

For case 4, was the "fixed value" usize::MAX (i.e. CC effectively disabled)? If so, the fact that 4 is slower than 6/7 suggests that flow control is a significant bottleneck. That's not surprising for the simulated network which has a greater bandwidth-delay product than the defaults target (at most roughly 100MiB/s * 100ms, or 1GiB/s * 10ms). That should be true for any implementation with the same configuration, though -- it's explicitly illegal to send more than a full flow-control window per round-trip, since you can't know if the receiver is ready for more until they've responded.

How big a gap is there between "40% improvement" and the expected 1Gbps on the internet?

Ralith avatar Aug 13 '25 22:08 Ralith

Few answers :

  • Internet link capacity is around 2 Gbs, baseline quinn does 450Mbps with 15-20 ms latency and an unknown amount of drops
  • Simulated link capacity is limited by CPU power, baseline quinn does 950Mbps with 10ms latency and no drops
  • TransportConfig::receive_window is infinite by default we did not changed it
  • For case 4 Fixed CC was set at 30MB

stormshield-damiend avatar Aug 14 '25 07:08 stormshield-damiend

One more day of tests :

  • We found that default setting for stream receive window and send window are too low to reach line rate on the internet test bed. We ran some test varying those two setting using CC = UMAX, with the default value we reach the usual 450 Mb/s, with at least 8 to 16M for both we reached 1 Gbps. This was also confirmed on the simulated test bed.
  • We noticed that secnetperf has ECN disabled by default, we try with and without and performance are the same (~ 1Gbps) on the internet test bed.
  • We tried to disable ECN from congestion event using CC = cubic with spurious on Quinn and it did not had any impact on performance
  • We made some graph of Cwnd vs bytes in flight for CC = cubic with spurious loss on both test bed, @stormshield-fabs will share some right now
  • On the internet test bed, iperf3 reaches a Cwnd of 7 MB whereas Quinn with CC = cubic + spurious loss stays at 1 MB

Internet / CC = Cubic + spurious / 16M / 16M -> 1 Gbps

Image

Simulated / CC = cubic + spurious / 16M / 16M -> we dont remember the throughput ;)

Image

With the same setting we saw an abnormal graph where Cwnd raise to close to infinity

Image

As a conclusion, the suspect seem to be related to generation of congestion event or also to their handling, ECN is not a suspect.

Bonus remark, we also have a poc of Cubic RFC9438 which does not lead to any gain but we understood that it is not working due to the previous suspected bugs.

stormshield-damiend avatar Aug 14 '25 15:08 stormshield-damiend

Taking inspiration from https://github.com/mozilla/neqo/issues/1472, we've been plotting sent, acked and lost packets over time (on a simulated network with a 10 ms latency, --upload-size 1000000000 --duration 20).

Image

So far, all lost packets are within the same pattern. We're trying to make sense of that.

stormshield-fabs avatar Aug 21 '25 09:08 stormshield-fabs

Comparing the timings of packet emission, reception and acknowledgment on both the server and the client showed that the packet losses were real. We are using a simulated network using netem with a 10ms latency.

Image

We also dumped the traffic on both containers using tcpdump and it showed that the presumably lost packets arrive at the interface on the server side (remember we're doing an upload test), but quinn never sees them.

We believe the packets are dropped by the operating system at the socket level, maybe because quinn does not read them fast enough..?

To test this hypothesis, we've tried increasing net.core.rmem_max (the default value is 212992). quinn/perf tries to increase this value to 1MB at startup but the kernel denies the request but I haven't see anything relating in quinn's code. It's worth noting that msquic appears to be asking a SO_RCVBUF=INT32_MAX when it instantiates the socket.

With the default value (and using 32MB for stream_receive_window and send_window), quinn reaches 1.5GBps when msquic/secnetperf reaches 5.4 GBps.

With rmem_max = 2000000, quinn reaches an astounding 5GBps (whereas msquic/secnetperf reaches 14.7GBps...).

We plan on testing this hypothesis on the "real" internet test bed.

stormshield-fabs avatar Aug 21 '25 14:08 stormshield-fabs

More tests on the internet link:

  • baseline (32MB for stream_receive_window and send_window, recv buffer size at 425984) : 363 MBps
  • with recv buffer size at 2097152 : 1.03 Gbps (which is the same value as when using the "constant congestion controller")

Here's the cwnd for CUBIC with the large receive buffer, as well as the in-flight bytes : we've never seen them so high. This would mean that CUBIC works mostly correctly (with spurious loss detection enabled).

Image

stormshield-fabs avatar Aug 21 '25 15:08 stormshield-fabs

Summarizing chat discussion:

  • Looks like we're filling the cwnd correctly 🎉
    • With an exception in the simulated network, where the sender is probably CPU-limited and computing an inaccurate cwnd due to #1126
  • Sensitivity to kernel socket buffer size is expected at very high rates. This could be significantly improved by #1576. msquic is known to already have parallel receive, which might account for their higher performance.
  • We might want to explore choosing a better default socket buffer size, or even adapting it dynamically.

You previously excluded the socket buffer difference hypothesis in https://github.com/quinn-rs/quinn/issues/2262#issuecomment-2943235077. Were those numbers incorrect?

I think there's sysfs counters that report the number of packets dropped by the kernel and the kernel buffer fill level. Those might be a useful source of truth vs. inferring things from changed performance.

with recv buffer size at 2097152 : 1.03 Gbps

How much gap does that leave to msquic/expected real-world performance?

Ralith avatar Aug 21 '25 16:08 Ralith

You previously excluded the socket buffer difference hypothesis

Those numbers were correct : with the default rmem_max=212992, both quinn/perf and msquic/secnetperf use a receive buffer size of 425984 for the socket (this is a documented kernel behavior), and msquic performs significantly better. Also at the time we hadn't modified the stream_receive_window and send_window sizes.

How much gap does that leave to msquic/expected real-world performance?

What we found yesterday is that increasing rmem_max helped both implementations, but quinn showed the greatest speedup. Below is a summary table on an internet link (10 ms latency, iperf3 reaches 2 Gbps).

quinn msquic
rmem_max=212992 335 Mbps 927 Mbps
rmem_max=2097152 897 Mbps 1146 Mbps

We do not thing changing the socket buffer size is a valid long-term solution: we probably want quinn to perform well with the default OS settings. Also, reaching for higher values than rmem_max requires using the SO_RCVBUFFORCE socket option, and requires CAP_NET_ADMIN.

We'll take a look at sysfs to validate that packet losses occur at the kernel level.

stormshield-fabs avatar Aug 22 '25 07:08 stormshield-fabs

Makes sense, thanks. My guess would be that #1576 is the next low hanging fruit here, then.

Ralith avatar Aug 22 '25 16:08 Ralith

We were able to plot the socket's receive buffer size over time by parsing ss output. (I haven't figured out a better way the retrieve this data yet).

The figure shows the congestion window size on the sender side and the OS receive buffer size on the receiver side.

Image

It shows that the buffer is only 10% full most of the time, since the vast majority of samples are within this range. However, it looks like quinn is sometimes quite behind and thus usage spikes up, sometimes reaching the 400k hard limit.

Reaching the limit correlates with loss detection on the sender, but these events occur at different in flight sizes, suggesting quinn might not be too slow to empty the buffer as fast as it's being filled like we initially suggested, but that it might instead be experiencing seemingly-random delays.

Hooking up quinn_perf to tokio-console, we saw that both the endpoint and connection drivers experience very large maximum poll times reaching up to 7 ms on a few runs. We noticed a similar pattern with scheduled times.

Image

We focused on the EndpointDriver first since it had the largest observed tail latency. We used the minstant crate to measure the minimum and maximum time spent in the functions called from EndpointDriver::poll.

We've found that recvmmsg is sometimes taking up to a few milliseconds to complete.

Map from the number of packets received to recvmmsg timing
self.recvmmsg_min={
 1: 889ns
 2: 2.2µs
 3: 3.376µs
 4: 4.655µs
 5: 6.171µs
 6: 6.976µs
 7: 10.269µs
 8: 10.991µs
 9: 15.095µs
 10: 19.728µs
 11: 22.813µs
 12: 22.78µs
 13: 23.769µs
 14: 29.129µs
 15: 29.372µs
 16: 38.647µs
 17: 38.244µs
 18: 44.275µs
 19: 45.756µs
 20: 39.389µs
 21: 76.454µs
 22: 60.604µs
 23: 145.469µs
 24: 84.839µs
 25: 63.441µs
 26: 115.315µs
 27: 106.807µs
 28: 632.864µs
 29: 254.228µs
 30: 776.429µs
 31: 369.074µs
 32: 90.682µs
 } 
 self.recvmmsg_max={
 1: 450.576µs
 2: 548.917µs
 3: 517.159µs
 4: 627.393µs
 5: 856.479µs
 6: 860.488µs
 7: 1.167725ms
 8: 1.250136ms
 9: 1.064526ms
 10: 898.995µs
 11: 1.382694ms
 12: 1.031123ms
 13: 963.984µs
 14: 1.082042ms
 15: 915.435µs
 16: 1.478495ms
 17: 1.136442ms
 18: 1.601473ms
 19: 1.39671ms
 20: 1.172288ms
 21: 508.315µs
 22: 839.461µs
 23: 726.151µs
 24: 1.290356ms
 25: 1.976418ms
 26: 625.348µs
 27: 1.526342ms
 28: 1.47945ms
 29: 859.607µs
 30: 1.823659ms
 31: 369.074µs
 32: 1.166615ms
}

This behavior does not correlate with the number of GRO packets received. We can see that the minimum values are mostly acceptable whereas the maximum value reach the millisecond range quite often. We have also tried changing the BATCH_SIZE, without success.

What's surprising is that using strace -T to track recvmmsg timings paints a very different picture since it reports execution times of around a few nanoseconds. (The y-axis on the graph below is µs). I think this might be because strace only accounts for the effective time spent in the syscall while our instrumentation also measures scheduling delays and context switches.

Image

We looked again at msquic's implementation and found that they poll the socket in a dedicated thread that forwards packets to another one for processing. strace revealed that it receives packets one-by-one.

stormshield-fabs avatar Aug 29 '25 08:08 stormshield-fabs

Is substantially all of the Endpointdriver lag/variance due to recvmmsg() calls?

djc avatar Aug 29 '25 10:08 djc

Yes, the maximum execution times we collected for EndpointDriver::poll and recvmmsg were always very close.

stormshield-fabs avatar Aug 29 '25 11:08 stormshield-fabs

We replicated the recvmmsg instrumentation in msquic and saw similar maximum latency of around 1ms.

msquic appears to be using a dedicated thread for all socket reads and writes (at least in the epoll implementation). If GRO is available, it only asks recvmmsg to receive a single packet, using multi-packet reception as a fallback when GRO coalescing is disabled.

So far our conclusion is that quinn might benefit from some architecture work to mitigate the impact of recvmmsg latency on the packet handling path.

stormshield-fabs avatar Aug 29 '25 13:08 stormshield-fabs