quinn
quinn copied to clipboard
Performance issue on internet
This is similar to what discussed in #2153 .
Here are some tests we made on the following setup :
- Two VMs in two different datacenters in the cloud
- Latency between them is around 10ms (ping)
- Network driver is virtio_net with generic-segmentation-offload: on, generic-receive-offload: on and tx-udp-segmentation: off [fixed] (from ethtool)
- We test uploading from one VM to the other
First here is the result of an iperf3 test between the two VMs (TCP non ciphered)
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 1.12 GBytes 958 Mbits/sec 1102 sender
[ 5] 0.00-10.01 sec 1.11 GBytes 955 Mbits/sec receiver
This matches the expected line rate between the two VMs.
We then use secnetperf from msquic project (UDP with encryption) https://github.com/microsoft/msquic/tree/main/src/perf
Result: Upload 1298923520 bytes @ 913880 kbps (11370.619 ms).
We are close to the line rate.
We then used quinn/perf client and server (UDP with encryption)
Overall stats:
RPS: 37.25 (373 requests in 10.01s)
Stream metrics:
│ Upload Duration │ Download Duration | FBL | Upload Throughput | Download Throughput
──────┼─────────────────┼───────────────────┼────────────┼───────────────────┼────────────────────
AVG │ 26.71ms │ 57.00µs │ 57.00µs │ 323.99 Mb/s │ 0.00 Mb/s
P0 │ 18.21ms │ 1.00µs │ 1.00µs │ 99.09 Mb/s │ 0.00 Mb/s
P10 │ 22.98ms │ 2.00µs │ 2.00µs │ 287.31 Mb/s │ 0.00 Mb/s
P50 │ 25.30ms │ 47.00µs │ 47.00µs │ 331.87 Mb/s │ 0.00 Mb/s
P90 │ 29.20ms │ 124.00µs │ 124.00µs │ 365.43 Mb/s │ 0.00 Mb/s
P100 │ 84.67ms │ 398.00µs │ 398.00µs │ 460.59 Mb/s │ 0.00 Mb/s
We are around half the line rate.
If we disable crypto using --no-protection
Overall stats:
RPS: 38.84 (389 requests in 10.02s)
Stream metrics:
│ Upload Duration │ Download Duration | FBL | Upload Throughput | Download Throughput
──────┼─────────────────┼───────────────────┼────────────┼───────────────────┼────────────────────
AVG │ 25.58ms │ 76.00µs │ 76.00µs │ 337.91 Mb/s │ 0.00 Mb/s
P0 │ 15.38ms │ 1.00µs │ 1.00µs │ 100.66 Mb/s │ 0.00 Mb/s
P10 │ 21.22ms │ 5.00µs │ 4.00µs │ 283.64 Mb/s │ 0.00 Mb/s
P50 │ 24.70ms │ 70.00µs │ 70.00µs │ 339.74 Mb/s │ 0.00 Mb/s
P90 │ 29.60ms │ 143.00µs │ 143.00µs │ 395.84 Mb/s │ 0.00 Mb/s
P100 │ 83.33ms │ 578.00µs │ 578.00µs │ 545.26 Mb/s │ 0.00 Mb/s
We get a bit better but are still far from the line rate
I can dedicate some work time to work on this issue, if you have any insight feel free to comment.
Secnetperf socket parameters (from ss -um) : skmem:(r0,rb425984,t0,tb212992,f4096,w0,o0,bl0,d0)
Quinn perf socket parameters (from ss -lnpum) : skmem:(r0,rb425984,t0,tb425984,f4096,w0,o0,bl0,d0)
So this does not seem related to socket recv / send buffer size.
Thanks for digging into performance like this!
What packet loss are you observing on this link? What congestion controller are you using? Are you CPU bound or I/O bound? Could allocations/buffering be a relevant constraint here?
No problem.
I will measure packet loss or print what quinn / secnetperf sees.
Congestion controller is default for quinn so Cubic, msquic/secnetperf support both (cubic is default) , i will do a measurement for cc=bbr.
For the cpu usage during the test : quinn : client around 15% / server around 35% secnetperf: client around 25% / server around 55%
I can try with other allocators (tikv-jemalloc / mimalloc / ...) and look at what allocator is using msquic/secnetperf as default.
Msquic/secnetperf with cc=bbr : Result: Upload 1760428032 bytes @ 1365141 kbps (10316.459 ms). So it is faster than default which is cc=cubic.
Connection stats for quinn/perf (upload)
- Client side
Overall stats:
RPS: 34.96 (350 requests in 10.01s)
Stream metrics:
│ Upload Duration │ Download Duration | FBL | Upload Throughput | Download Throughput
──────┼─────────────────┼───────────────────┼────────────┼───────────────────┼────────────────────
AVG │ 28.44ms │ 69.00µs │ 69.00µs │ 309.47 Mb/s │ 0.00 Mb/s
P0 │ 17.28ms │ 1.00µs │ 1.00µs │ 100.01 Mb/s │ 0.00 Mb/s
P10 │ 21.62ms │ 2.00µs │ 2.00µs │ 231.21 Mb/s │ 0.00 Mb/s
P50 │ 26.90ms │ 69.00µs │ 69.00µs │ 311.95 Mb/s │ 0.00 Mb/s
P90 │ 36.00ms │ 140.00µs │ 140.00µs │ 388.24 Mb/s │ 0.00 Mb/s
P100 │ 83.90ms │ 320.00µs │ 320.00µs │ 485.49 Mb/s │ 0.00 Mb/s
ConnectionStats {
udp_tx: UdpStats { datagrams: 260483, bytes: 376892168, ios: 28741 },
udp_rx: UdpStats { datagrams: 9150, bytes: 376861, ios: 9151 },
frame_tx: FrameStats { ACK: 1055, ACK_FREQUENCY: 0, CONNECTION_CLOSE: 0, CRYPTO: 2, DATA_BLOCKED: 0, DATAGRAM: 0, HANDSHAKE_DONE: 0, IMMEDIATE_ACK: 4, MAX_DATA: 0, MAX_STREAM_DATA: 0, MAX_STREAMS_BIDI: 0, MAX_STREAMS_UNI: 0, NEW_CONNECTION_ID: 5, NEW_TOKEN: 0, PATH_CHALLENGE: 0, PATH_RESPONSE: 0, PING: 4, RESET_STREAM: 0, RETIRE_CONNECTION_ID: 0, STREAM_DATA_BLOCKED: 0, STREAMS_BLOCKED_BIDI: 0, STREAMS_BLOCKED_UNI: 0, STOP_SENDING: 0, STREAM: 259755 },
frame_rx: FrameStats { ACK: 7949, ACK_FREQUENCY: 0, CONNECTION_CLOSE: 0, CRYPTO: 3, DATA_BLOCKED: 0, DATAGRAM: 0, HANDSHAKE_DONE: 1, IMMEDIATE_ACK: 4, MAX_DATA: 0, MAX_STREAM_DATA: 1680, MAX_STREAMS_BIDI: 26, MAX_STREAMS_UNI: 0, NEW_CONNECTION_ID: 4, NEW_TOKEN: 2, PATH_CHALLENGE: 0, PATH_RESPONSE: 0, PING: 4, RESET_STREAM: 0, RETIRE_CONNECTION_ID: 1, STREAM_DATA_BLOCKED: 0, STREAMS_BLOCKED_BIDI: 0, STREAMS_BLOCKED_UNI: 0, STOP_SENDING: 0, STREAM: 350 },
path: PathStats { rtt: 10.027276ms, cwnd: 687939, congestion_events: 4, lost_packets: 24, lost_bytes: 34848, sent_packets: 260484, sent_plpmtud_probes: 4, lost_plpmtud_probes: 0, black_holes_detected: 0, current_mtu: 1452 }
}
- Server Side
ConnectionStats {
udp_tx: UdpStats { datagrams: 9303, bytes: 383096, ios: 9303 },
udp_rx: UdpStats { datagrams: 265194, bytes: 383699307, ios: 265195 },
frame_tx: FrameStats { ACK: 8079, ACK_FREQUENCY: 0, CONNECTION_CLOSE: 0, CRYPTO: 3, DATA_BLOCKED: 0, DATAGRAM: 0, HANDSHAKE_DONE: 1, IMMEDIATE_ACK: 4, MAX_DATA: 0, MAX_STREAM_DATA: 1710, MAX_STREAMS_BIDI: 27, MAX_STREAMS_UNI: 0, NEW_CONNECTION_ID: 4, NEW_TOKEN: 2, PATH_CHALLENGE: 0, PATH_RESPONSE: 0, PING: 4, RESET_STREAM: 0, RETIRE_CONNECTION_ID: 1, STREAM_DATA_BLOCKED: 0, STREAMS_BLOCKED_BIDI: 0, STREAMS_BLOCKED_UNI: 0, STOP_SENDING: 0, STREAM: 357 },
frame_rx: FrameStats { ACK: 1075, ACK_FREQUENCY: 0, CONNECTION_CLOSE: 1, CRYPTO: 2, DATA_BLOCKED: 0, DATAGRAM: 0, HANDSHAKE_DONE: 0, IMMEDIATE_ACK: 4, MAX_DATA: 0, MAX_STREAM_DATA: 0, MAX_STREAMS_BIDI: 0, MAX_STREAMS_UNI: 0, NEW_CONNECTION_ID: 5, NEW_TOKEN: 0, PATH_CHALLENGE: 0, PATH_RESPONSE: 0, PING: 4, RESET_STREAM: 0, RETIRE_CONNECTION_ID: 0, STREAM_DATA_BLOCKED: 0, STREAMS_BLOCKED_BIDI: 0, STREAMS_BLOCKED_UNI: 0, STOP_SENDING: 0, STREAM: 264447 },
path: PathStats { rtt: 9.62579ms, cwnd: 12000, congestion_events: 0, lost_packets: 0, lost_bytes: 0, sent_packets: 9303, sent_plpmtud_probes: 4, lost_plpmtud_probes: 0, black_holes_detected: 0, current_mtu: 1452 }
}
From client to server 4 congestions events / 24 lost packets / 34848 bytes.
It looks like you're comparing connection-level stats in iperf3/secnetperf to stream-level stats in Quinn. You also don't seem to be using the default perf configuration. What is the exact command line you're passing to the perf client? If you're using any stream concurrency, then the per-stream throughput will be much lower than the connection-level throughput, because multiple streams are running concurrently. If you aren't using any stream concurrency, then the pipeline is stalling as each "request" waits for a response from the server before the next is issued (see perf client behavior). Judging by sent_packets * current_mtu, the latter seems most likely. At 350 requests * 10ms round trip (minimum), that's accounting for more than a third of your total runtime.
If you want to run a simple "blast bytes down the stream as fast as possible" benchmark, you should set a very large upload and/or download size so you don't spend time significant time waiting for responses, or confuse matters with concurrent streams.
Hi, when setting a high value for upload or download in perf client, too high to be reached before the duration, no stats are displayed. We made a quick patch using a CancellationToken to have all futures stop correctly but it is not finished for now.
Anyway here is the upload performance for a 10sec upload using a single stream :
Here is the arguments given to perf_client : --duration 10 --download-size 0 --upload-size 1048576000 --conn-stats
Client
ConnectionStats {
udp_tx: UdpStats { datagrams: 292333, bytes: 424455184, ios: 35585 },
udp_rx: UdpStats { datagrams: 15306, bytes: 632183, ios: 15307 },
frame_tx: FrameStats { ACK: 1469, ACK_FREQUENCY: 0, CONNECTION_CLOSE: 0, CRYPTO: 2, DATA_BLOCKED: 0, DATAGRAM: 0, HANDSHAKE_DONE: 0, IMMEDIATE_ACK: 4, MAX_DATA: 0, MAX_STREAM_DATA: 0, MAX_STREAMS_BIDI: 0, MAX_STREAMS_UNI: 0, NEW_CONNECTION_ID: 5, NEW_TOKEN: 0, PATH_CHALLENGE: 0, PATH_RESPONSE: 0, PING: 4, RESET_STREAM: 0, RETIRE_CONNECTION_ID: 0, STREAM_DATA_BLOCKED: 0, STREAMS_BLOCKED_BIDI: 0, STREAMS_BLOCKED_UNI: 0, STOP_SENDING: 0, STREAM: 292327 },
frame_rx: FrameStats { ACK: 14252, ACK_FREQUENCY: 0, CONNECTION_CLOSE: 0, CRYPTO: 3, DATA_BLOCKED: 0, DATAGRAM: 0, HANDSHAKE_DONE: 1, IMMEDIATE_ACK: 4, MAX_DATA: 0, MAX_STREAM_DATA: 2267, MAX_STREAMS_BIDI: 0, MAX_STREAMS_UNI: 0, NEW_CONNECTION_ID: 4, NEW_TOKEN: 2, PATH_CHALLENGE: 0, PATH_RESPONSE: 0, PING: 4, RESET_STREAM: 0, RETIRE_CONNECTION_ID: 1, STREAM_DATA_BLOCKED: 0, STREAMS_BLOCKED_BIDI: 0, STREAMS_BLOCKED_UNI: 0, STOP_SENDING: 0, STREAM: 0 },
path: PathStats { rtt: 9.547958ms, cwnd: 355389, congestion_events: 11, lost_packets: 490, lost_bytes: 711480, sent_packets: 292334, sent_plpmtud_probes: 4, lost_plpmtud_probes: 0, black_holes_detected: 0, current_mtu: 1452 }
}
2025-06-10T15:28:37.184412Z INFO perf_client: shutting down (timeout)
2025-06-10T15:28:37.185629Z WARN perf_client: request: leaving
2025-06-10T15:28:37.217222Z INFO perf_client: stream client bidirectional stream 0 uploaded 421527552 (bytes) elapsed 10.233532888 (secs) throughput 329.53 Mb/s
Here are the corresponding stats for the server side :
ConnectionStats {
udp_tx: UdpStats { datagrams: 15708, bytes: 648697, ios: 15708 },
udp_rx: UdpStats { datagrams: 297263, bytes: 431609556, ios: 297264 },
frame_tx: FrameStats { ACK: 14637, ACK_FREQUENCY: 0, CONNECTION_CLOSE: 0, CRYPTO: 3, DATA_BLOCKED: 0, DATAGRAM: 0, HANDSHAKE_DONE: 1, IMMEDIATE_ACK: 4, MAX_DATA: 0, MAX_STREAM_DATA: 2312, MAX_STREAMS_BIDI: 0, MAX_STREAMS_UNI: 0, NEW_CONNECTION_ID: 4, NEW_TOKEN: 2, PATH_CHALLENGE: 0, PATH_RESPONSE: 0, PING: 4, RESET_STREAM: 0, RETIRE_CONNECTION_ID: 1, STREAM_DATA_BLOCKED: 0, STREAMS_BLOCKED_BIDI: 0, STREAMS_BLOCKED_UNI: 0, STOP_SENDING: 0, STREAM: 1 },
frame_rx: FrameStats { ACK: 1505, ACK_FREQUENCY: 0, CONNECTION_CLOSE: 0, CRYPTO: 2, DATA_BLOCKED: 0, DATAGRAM: 0, HANDSHAKE_DONE: 0, IMMEDIATE_ACK: 4, MAX_DATA: 0, MAX_STREAM_DATA: 0, MAX_STREAMS_BIDI: 0, MAX_STREAMS_UNI: 0, NEW_CONNECTION_ID: 5, NEW_TOKEN: 0, PATH_CHALLENGE: 0, PATH_RESPONSE: 0, PING: 4, RESET_STREAM: 0, RETIRE_CONNECTION_ID: 0, STREAM_DATA_BLOCKED: 0, STREAMS_BLOCKED_BIDI: 0, STREAMS_BLOCKED_UNI: 0, STOP_SENDING: 0, STREAM: 297256 },
path: PathStats { rtt: 9.824697ms, cwnd: 12000, congestion_events: 0, lost_packets: 0, lost_bytes: 0, sent_packets: 15709, sent_plpmtud_probes: 4, lost_plpmtud_probes: 0, black_holes_detected: 0, current_mtu: 1452 } }
Next move is to look at TransportConfiguration and compare it to secnetperf (i already spotted they use a far more aggressive ack delay value than the default one from quinn).
Thanks for the update! I'm surprised that getting rid of the pipeline bubbles had so little effect. Maybe the original per-stream statistics were already excluding that well enough.
Another area to compare between implementations is congestion controller configuration. A larger initial congestion window, for example, can dramatically improve initial performance on high-bandwidth links.
Given the low quinn CPU usage you report, especially compared to secnetperf's, it seems clear that the bottleneck is quinn choosing not to send more data, which suggests an issue in pacing, congestion control, or flow control. In particular, our Cubic implementation is fairly naive (e.g. we haven't implemented HyStart++), so there may be low hanging fruit there. Computing the proportion of time spent working vs. waiting for pacing vs. waiting for congestion control budget vs. waiting for flow control budget might also help.
If you want to take a look at msquic setting, they are all documented here https://github.com/microsoft/msquic/blob/main/docs/api/QUIC_SETTINGS.md
As discussed on discord here is a plot of the evolution of the congestion window during a 600 sec upload test (i just used --conn-stats to extract and plot the value
This does not look good.
Throughput was 349.77 Mb/s.
By the way this confirm that quinn perf is blocked by congestion controller
2025-06-24T07:22:51.816788Z TRACE drive{id=0}: quinn_proto::connection: blocked by congestion control
2025-06-24T07:22:51.817842Z TRACE drive{id=0}:send{space=Data pn=15567}: quinn_proto::connection: blocked by congestion control
2025-06-24T07:22:51.817857Z TRACE drive{id=0}: quinn_proto::connection: blocked by congestion control
2025-06-24T07:22:51.818487Z TRACE drive{id=0}:send{space=Data pn=15574}: quinn_proto::connection: blocked by congestion control
2025-06-24T07:22:51.818516Z TRACE drive{id=0}: quinn_proto::connection: blocked by congestion control
Well, that confirms it was blocked at two specific points in time. We really need to chart/aggregate state over time to draw meaningful conclusions -- for example, compare time spent waiting for congestion budget to time spent waiting for pacing or flow control. Even that's subtle because we should usually be blocked on congestion control -- just not so much that we can't achieve the expected throughput.
Regardless, from @stormshield-fabs's data, I think we can reasonably conclude that the congestion controller is doing something wrong, perhaps growing too slowly. A good next step might be trying to reproduce the bad behavior in a narrow unit test, and if that fails, isolating what's different between the unit test scenario and the real-world slow workload.
MR around this topic (for reference) : #2278 #2289
Some new tests.
First some context :
- in #2300 i changed the way data are uploaded from client to server to write as many chunks of preallocated data using
write_chunks(). - i then used a modified version of #1659 to have a reproducible test environment with stable latency and no packet drops as the tests we ran on internet depend on path latency / drops and load.
The goal was to compare the impact of latency on throughput on various Quic stack, i tested Quinn based on #2300, secnetperf from msquic (a commit few months ago) and latest master version of s2n-quic.
Here are the results :
|------------+---------------+--------------+--------------+--------------+---------------|
| | 0ms | 10ms (-l 5) | 20ms (-l 10) | 40ms (-l 20) | 100ms (-l 50) |
|------------+---------------+--------------+--------------+--------------+---------------|
| quinn | 11795.78 Mb/s | 918.29 Mb/s | 465.44 Mb/s | 231.67 Mb/s | 95.13 Mb/s |
| | 12276.73 Mb/s | 915.14 Mb/s | 466.22 Mb/s | 235.21 Mb/s | 94.67 Mb/s |
| | 12359.78 Mb/s | 914.10 Mb/s | 464.39 Mb/s | 234.82 Mb/s | |
|------------+---------------+--------------+--------------+--------------+---------------|
| secnetperf | 14835133 kbps | 5248351 kbps | 4299376 kbps | 4742731 kbps | 2811823 kbps |
| | 16351173 kbps | 5453510 kbps | 5880245 kbps | 4565408 kbps | 2237253 kbps |
|------------+---------------+--------------+--------------+--------------+---------------|
| s2n-quic | 13.01Gbps | 1.90Gbps | 1.25Gbps | 602.96Mbps | 285.51Mbps |
| | 13.05Gbps | 1.85Gbps | 1.29Gbps | 605.37Mbps | 291.70Mbps |
|------------+---------------+--------------+--------------+--------------+---------------|
We can see that all stacks suffer from latency and achieve a lower throughput the more latency there is, but quinn seems to be far more impacted than other stacks.
More tests done today with @stormshield-fabs, here are some results :
First, simulated network has 0 packet loss and a latency of 10ms. Second, internet network has some packet loss (not measured) and a latency of 15-20ms.
- We tried disabling pacer / CC / flow control connection + stream without success, no packets are emitted (flow control patch seem to be wrong)
- We tried disabling pacer alone and saw smaller performance results on both internet and simulated
- We tried disabling pacer / CC (On internet we saw a 20% increase, on simulated network we saw no gain)
- We implemented a constant CC which always reply a fixed value (On internet we saw a 40% gain, on simulated network we saw nothing)
- We tried with stream_recv_window = send_window = UMAX with cubic old with spurious loss detection (on internet nothing changed, on simulated we reached 2.5 Gbps on a 10ms latency)
- We tried with stream_recv_window = send_window = UMAX and using constant CC = UMAX (on internet we reached 1 Gbps, on simulated network 10ms we reached 10 Gbps)
- We tried with stream_recv_window = send_window = UMAX, constant CC = UMAX and pacer disabled (on internet we also reached 1 Gbps , on simulated network 10ms we reached 10 Gbps)
- We tried with stream_recv_window = send_window = UMAX, cubic old with initial windows = UMAX (on internet we reach 1 Gbps for some seconds then value lowers to smaller value to 500 Mbps, on simulated network we reached 10 Gbps)
Interpretation : 4 suggest that on a non optimal link (internet with possible loss and packet reordering) we may trigger loss detection too often leading to the CC window lowering leading to lower throughput (just and intuition for now) 5 and 6 seem to be the best in term of gain but using the Constant CC change also the pacer so hard to find which modification is responsible for the gain 6 and 7 suggest that the pacer is not part of the issue 7 and 8 we are not sure what it suggest ;)
Interesting. I concur with your analysis that the pacer isn't a relevant bottleneck. 7 is unremarkable: a configuration which already saturates the path shouldn't change if we remove even more constraints, except possibly to degrade if something gets overwhelmed. For 8, I'd guess that the simulated network is buffering to an unrealistic event, whereas on the internet severe congestion arises which triggers aggressive backoff.
Just to double-check, when setting flow control limits to usize::MAX, you also set TransportConfig::receive_window? Only changing stream-level limits while the connection-level limit is at default wouldn't have a huge effect. Either you did get this right or it doesn't matter since you observed good throughput in some configurations anyway.
Overall this supports the hypothesis that our congestion control/loss recovery is responsible for poor throughput. Brainstorming some possibilities:
- We're still(!) detecting more congestion/loss events than other implementations (testable by comparing traces/statistics between impls)
- We're responding to a similar number of congestion/loss events more conservatively (testable by driving different implementations' congestion controllers with identical workloads in pure unit tests, or more roughly by eyeballing charted cwnd changes)
- Loss detection and congestion control are working fine, but we're failing to fill the whole congestion window (testable by comparing congestion control window to actual in-flight data size by analysis of Quinn qlogs)
- Some difference in configuration between implementations
For case 4, was the "fixed value" usize::MAX (i.e. CC effectively disabled)? If so, the fact that 4 is slower than 6/7 suggests that flow control is a significant bottleneck. That's not surprising for the simulated network which has a greater bandwidth-delay product than the defaults target (at most roughly 100MiB/s * 100ms, or 1GiB/s * 10ms). That should be true for any implementation with the same configuration, though -- it's explicitly illegal to send more than a full flow-control window per round-trip, since you can't know if the receiver is ready for more until they've responded.
How big a gap is there between "40% improvement" and the expected 1Gbps on the internet?
Few answers :
- Internet link capacity is around 2 Gbs, baseline quinn does 450Mbps with 15-20 ms latency and an unknown amount of drops
- Simulated link capacity is limited by CPU power, baseline quinn does 950Mbps with 10ms latency and no drops
TransportConfig::receive_windowis infinite by default we did not changed it- For case 4 Fixed CC was set at 30MB
One more day of tests :
- We found that default setting for stream receive window and send window are too low to reach line rate on the internet test bed. We ran some test varying those two setting using CC = UMAX, with the default value we reach the usual 450 Mb/s, with at least 8 to 16M for both we reached 1 Gbps. This was also confirmed on the simulated test bed.
- We noticed that secnetperf has ECN disabled by default, we try with and without and performance are the same (~ 1Gbps) on the internet test bed.
- We tried to disable ECN from congestion event using CC = cubic with spurious on Quinn and it did not had any impact on performance
- We made some graph of Cwnd vs bytes in flight for CC = cubic with spurious loss on both test bed, @stormshield-fabs will share some right now
- On the internet test bed, iperf3 reaches a Cwnd of 7 MB whereas Quinn with CC = cubic + spurious loss stays at 1 MB
Internet / CC = Cubic + spurious / 16M / 16M -> 1 Gbps
Simulated / CC = cubic + spurious / 16M / 16M -> we dont remember the throughput ;)
With the same setting we saw an abnormal graph where Cwnd raise to close to infinity
As a conclusion, the suspect seem to be related to generation of congestion event or also to their handling, ECN is not a suspect.
Bonus remark, we also have a poc of Cubic RFC9438 which does not lead to any gain but we understood that it is not working due to the previous suspected bugs.
Taking inspiration from https://github.com/mozilla/neqo/issues/1472, we've been plotting sent, acked and lost packets over time (on a simulated network with a 10 ms latency, --upload-size 1000000000 --duration 20).
So far, all lost packets are within the same pattern. We're trying to make sense of that.
Comparing the timings of packet emission, reception and acknowledgment on both the server and the client showed that the packet losses were real. We are using a simulated network using netem with a 10ms latency.
We also dumped the traffic on both containers using tcpdump and it showed that the presumably lost packets arrive at the interface on the server side (remember we're doing an upload test), but quinn never sees them.
We believe the packets are dropped by the operating system at the socket level, maybe because quinn does not read them fast enough..?
To test this hypothesis, we've tried increasing net.core.rmem_max (the default value is 212992). quinn/perf tries to increase this value to 1MB at startup but the kernel denies the request but I haven't see anything relating in quinn's code. It's worth noting that msquic appears to be asking a SO_RCVBUF=INT32_MAX when it instantiates the socket.
With the default value (and using 32MB for stream_receive_window and send_window), quinn reaches 1.5GBps when msquic/secnetperf reaches 5.4 GBps.
With rmem_max = 2000000, quinn reaches an astounding 5GBps (whereas msquic/secnetperf reaches 14.7GBps...).
We plan on testing this hypothesis on the "real" internet test bed.
More tests on the internet link:
- baseline (32MB for
stream_receive_windowandsend_window, recv buffer size at 425984) : 363 MBps - with recv buffer size at 2097152 : 1.03 Gbps (which is the same value as when using the "constant congestion controller")
Here's the cwnd for CUBIC with the large receive buffer, as well as the in-flight bytes : we've never seen them so high. This would mean that CUBIC works mostly correctly (with spurious loss detection enabled).
Summarizing chat discussion:
- Looks like we're filling the cwnd correctly 🎉
- With an exception in the simulated network, where the sender is probably CPU-limited and computing an inaccurate cwnd due to #1126
- Sensitivity to kernel socket buffer size is expected at very high rates. This could be significantly improved by #1576. msquic is known to already have parallel receive, which might account for their higher performance.
- We might want to explore choosing a better default socket buffer size, or even adapting it dynamically.
You previously excluded the socket buffer difference hypothesis in https://github.com/quinn-rs/quinn/issues/2262#issuecomment-2943235077. Were those numbers incorrect?
I think there's sysfs counters that report the number of packets dropped by the kernel and the kernel buffer fill level. Those might be a useful source of truth vs. inferring things from changed performance.
with recv buffer size at 2097152 : 1.03 Gbps
How much gap does that leave to msquic/expected real-world performance?
You previously excluded the socket buffer difference hypothesis
Those numbers were correct : with the default rmem_max=212992, both quinn/perf and msquic/secnetperf use a receive buffer size of 425984 for the socket (this is a documented kernel behavior), and msquic performs significantly better. Also at the time we hadn't modified the stream_receive_window and send_window sizes.
How much gap does that leave to msquic/expected real-world performance?
What we found yesterday is that increasing rmem_max helped both implementations, but quinn showed the greatest speedup. Below is a summary table on an internet link (10 ms latency, iperf3 reaches 2 Gbps).
| quinn | msquic | |
|---|---|---|
rmem_max=212992 |
335 Mbps | 927 Mbps |
rmem_max=2097152 |
897 Mbps | 1146 Mbps |
We do not thing changing the socket buffer size is a valid long-term solution: we probably want quinn to perform well with the default OS settings. Also, reaching for higher values than rmem_max requires using the SO_RCVBUFFORCE socket option, and requires CAP_NET_ADMIN.
We'll take a look at sysfs to validate that packet losses occur at the kernel level.
Makes sense, thanks. My guess would be that #1576 is the next low hanging fruit here, then.
We were able to plot the socket's receive buffer size over time by parsing ss output. (I haven't figured out a better way the retrieve this data yet).
The figure shows the congestion window size on the sender side and the OS receive buffer size on the receiver side.
It shows that the buffer is only 10% full most of the time, since the vast majority of samples are within this range. However, it looks like quinn is sometimes quite behind and thus usage spikes up, sometimes reaching the 400k hard limit.
Reaching the limit correlates with loss detection on the sender, but these events occur at different in flight sizes, suggesting quinn might not be too slow to empty the buffer as fast as it's being filled like we initially suggested, but that it might instead be experiencing seemingly-random delays.
Hooking up quinn_perf to tokio-console, we saw that both the endpoint and connection drivers experience very large maximum poll times reaching up to 7 ms on a few runs. We noticed a similar pattern with scheduled times.
We focused on the EndpointDriver first since it had the largest observed tail latency. We used the minstant crate to measure the minimum and maximum time spent in the functions called from EndpointDriver::poll.
We've found that recvmmsg is sometimes taking up to a few milliseconds to complete.
Map from the number of packets received to recvmmsg timing
self.recvmmsg_min={
1: 889ns
2: 2.2µs
3: 3.376µs
4: 4.655µs
5: 6.171µs
6: 6.976µs
7: 10.269µs
8: 10.991µs
9: 15.095µs
10: 19.728µs
11: 22.813µs
12: 22.78µs
13: 23.769µs
14: 29.129µs
15: 29.372µs
16: 38.647µs
17: 38.244µs
18: 44.275µs
19: 45.756µs
20: 39.389µs
21: 76.454µs
22: 60.604µs
23: 145.469µs
24: 84.839µs
25: 63.441µs
26: 115.315µs
27: 106.807µs
28: 632.864µs
29: 254.228µs
30: 776.429µs
31: 369.074µs
32: 90.682µs
}
self.recvmmsg_max={
1: 450.576µs
2: 548.917µs
3: 517.159µs
4: 627.393µs
5: 856.479µs
6: 860.488µs
7: 1.167725ms
8: 1.250136ms
9: 1.064526ms
10: 898.995µs
11: 1.382694ms
12: 1.031123ms
13: 963.984µs
14: 1.082042ms
15: 915.435µs
16: 1.478495ms
17: 1.136442ms
18: 1.601473ms
19: 1.39671ms
20: 1.172288ms
21: 508.315µs
22: 839.461µs
23: 726.151µs
24: 1.290356ms
25: 1.976418ms
26: 625.348µs
27: 1.526342ms
28: 1.47945ms
29: 859.607µs
30: 1.823659ms
31: 369.074µs
32: 1.166615ms
}
This behavior does not correlate with the number of GRO packets received. We can see that the minimum values are mostly acceptable whereas the maximum value reach the millisecond range quite often. We have also tried changing the BATCH_SIZE, without success.
What's surprising is that using strace -T to track recvmmsg timings paints a very different picture since it reports execution times of around a few nanoseconds. (The y-axis on the graph below is µs). I think this might be because strace only accounts for the effective time spent in the syscall while our instrumentation also measures scheduling delays and context switches.
We looked again at msquic's implementation and found that they poll the socket in a dedicated thread that forwards packets to another one for processing. strace revealed that it receives packets one-by-one.
Is substantially all of the Endpointdriver lag/variance due to recvmmsg() calls?
Yes, the maximum execution times we collected for EndpointDriver::poll and recvmmsg were always very close.
We replicated the recvmmsg instrumentation in msquic and saw similar maximum latency of around 1ms.
msquic appears to be using a dedicated thread for all socket reads and writes (at least in the epoll implementation). If GRO is available, it only asks recvmmsg to receive a single packet, using multi-packet reception as a fallback when GRO coalescing is disabled.
So far our conclusion is that quinn might benefit from some architecture work to mitigate the impact of recvmmsg latency on the packet handling path.