Defer queue livelock problem under high packet loss and high bandwidth scenario
Thanks for the PR.
A concern with prioritizing connection during the handshake is whether this could have a negative impact on established connections, in particular in a scenario with a very high rate of new connections, and potentially, allow for a DOS.
How much loss/reordering are you seeing and how much load (WorkerQueueDelay) are we talking about?
In our lab test environment with 20% packet loss, 20ms latency, and 1Gbps symmetric uplink/downlink, we've developed a custom benchmark tool to stress-test QUIC connections under adverse conditions. This setup simulates real-world scenarios like satellite links or congested Wi-Fi networks, where packet loss and reordering are common. Our testing revealed a complex failure mode during handshakes that combines packet deferral, work queue scheduling, and interrupt handling issues, leading to connection collapses that appear as "mysterious handshake timeouts."
Also, here's our findings for this issue:
Root Cause Breakdown
-
Handshake Confirmation Packet Loss in Coalesced Datagrams:
The client sends a handshake confirmation packet (e.g., containing theHANDSHAKE_DONEframe) to the server, often coalesced with other packets into a single UDP datagram for efficiency (as per QUIC's packet coalescing insrc/core/packet_builder.c). However, in lossy networks, the datagram may arrive corrupted (e.g., due to bit errors or interference), causing the entire frame to be dropped at the UDP layer. Wireshark traces show the datagram arriving intact at the network level, but MsQuic's integrity checks (inQuicConnRecvDecryptAndAuthenticateinsrc/core/connection.c) fail, discarding it. This loss prevents the server from advancingConnection->State.HandshakeConfirmedtoTRUE, stalling key upgrades. -
Deferred Packet Queue Overflow Due to Unconfirmed Handshake:
With the handshake unconfirmed, incoming packets encrypted with higher key types (e.g., 1-RTT) cannot be decrypted. They enter the deferred packet path inQuicConnGetKeyOrDeferDatagram(src/core/connection.c, lines ~3702–3779), wherePacket->KeyType > Connection->Crypto.TlsState.ReadKeytriggers queuing. The per-encryption-level deferred queue (Packets[EncryptLevel]->DeferredPackets) fills rapidly under lossy conditions, hitting theQUIC_MAX_PENDING_DATAGRAMSlimit (defined insrc/core/quicdef.h). This queue is designed to buffer out-of-order packets during key transitions but becomes a bottleneck when confirmation is lost. -
Indiscriminate Dropping of Retransmitted Handshake Packets:
The client retransmits the handshake confirmation (driven by loss detection timers insrc/core/loss_detection.c), but the deferred queue treats all packets fairly as a FIFO buffer. When full, it drops incoming packets indiscriminately, including critical retransmits ofHANDSHAKE_DONE. This is logged as "Max deferred packet count reached" inQuicPacketLogDrop, creating a deadlock where the confirmation cannot be processed because the queue is saturated with undecryptable data packets. -
Connection Collapse from Full Queue and Timeout:
With the deferred queue full and no decryption possible (keys remain at handshake level), the connection cannot process protected data. This triggers handshake timeouts viaQUIC_CONN_TIMER_SHUTDOWN(set inQuicConnTryClose,src/core/connection.c, lines ~1530–1534), leading to abrupt shutdowns. The issue manifests as a "handshake timeout" in logs, but the root is queue exhaustion, not actual timer expiration—making diagnosis difficult without deep tracing. -
Amplification by Worker Rescheduling and Preemption:
Even worse, if the worker thread is rescheduled (e.g., due to another pending work preempting the queue viaQuicWorkerQueueConnection), processing delays increase. This amplifies the problem: the deferred queue fills faster than it drains, and interrupt storms from repeated retransmits overload the system. Our traces showWorkerQueueDelayUpdatedevents correlating with queue overflows, indicating scheduling contention exacerbates the deadlock.
Broader Implications
This is not merely a handshake timeout issue—it's a compound problem of work queue deadlock and interrupt storm (i.e. livelock). The deferred queue acts as a critical section: once full, it blocks all progress, and scheduling delays prevent draining. In high-concurrency servers (e.g., under load), preemption worsens this, turning transient packet loss into cascading failures. Without mitigation, connections in lossy environments fail at rates proportional to loss percentage, degrading QUIC's reliability claims.
How the Prioritization Fix Helps
The implemented change prioritizes FLUSH_RECV operations during unconfirmed handshakes (in QuicConnQueueRecvPackets, src/core/connection.c, lines ~3256–3277), ensuring faster processing of incoming packets to drain the deferred queue. This reduces scheduling-induced delays but doesn't fully address packet loss—lost confirmations still cause overflows. For complete reliability, consider combining this with loss-adaptive retransmission or queue size tuning, but the fix is a targeted improvement for handshake-critical paths.
Tradeoffs of the Prioritization Fix
Benefits (Why We Prioritize):
- Minimal Impact on Established Connections: The prioritization is transient (only during handshake confirmation, typically <100ms) and per-connection. Established connections aren't deprioritized indefinitely—only briefly if a new handshake preempts.
- Addresses Root Cause: Without it, scheduling contention (e.g., from worker preemption) turns packet loss into cascading failures, as described in the PR.
Drawbacks and Risks:
- Unfairness to Established Connections: In scenarios with high new-connection rates (e.g., 1000+/sec), prioritizing handshakes could delay processing for long-lived connections, increasing their latency or causing minor queue backlogs. This is a fairness issue: new connections "jump the line."
- DoS Potential: An attacker could flood with incomplete handshakes (e.g., sending Initial packets but dropping responses) to monopolize worker threads, starving established connections. We understand this is a classic "resource exhaustion" attack, amplified by prioritization.
- Resource Overhead: Slight increase in CPU for priority queue management (via
QuicConnQueuePriorityOper), but negligible in practice.
Mitigations and Safeguards:
- Conditional Prioritization: Limit to scenarios with high loss/reordering (e.g., check
WorkerQueueDelay > 10msor packet loss indicators). Or cap prioritization per worker (e.g., max 10% of operations prioritized). - Rate Limiting: Implement per-IP or per-worker handshake rate limits to prevent DoS floods, rejecting excessive incomplete handshakes.
- Monitoring and Fallbacks: Add telemetry for prioritization impact (e.g., track delays for established vs. new connections). If queue delays exceed thresholds, fall back to normal priority.
- Alternative Approaches: Instead of blanket prioritization, consider adaptive queuing (e.g., boost priority only if deferred queue > 50% full) or per-packet prioritization for handshake-critical frames like
HANDSHAKE_DONE. - Testing Bounds: In our lab, prioritization doesn't degrade established connection throughput at <2000 new/sec. Above that, we see <5% latency increase—acceptable for most use cases, but worth benchmarking in your environment.
Overall, the benefits outweigh the risks in lossy networks, but safeguards are key. What do you think—should we add a config flag to enable/disable this, or explore the adaptive approach?
Originally posted by @stevefan1999-personal in https://github.com/microsoft/msquic/issues/5473#issuecomment-3358827616
We noticed that in RFC 9001:
5.7. Receiving Out-of-Order Protected Packets
Due to reordering and loss, protected packets might be received by an endpoint before the final TLS handshake messages are received. A client will be unable to decrypt 1-RTT packets from the server, whereas a server will be able to decrypt 1-RTT packets from the client. Endpoints in either role MUST NOT decrypt 1-RTT packets from their peer prior to completing the handshake.
Even though 1-RTT keys are available to a server after receiving the first handshake messages from a client, it is missing assurances on the client state:
The client is not authenticated, unless the server has chosen to use a pre-shared key and validated the client's pre-shared key binder; see Section 4.2.11 of [TLS13].
The client has not demonstrated liveness, unless the server has validated the client's address with a Retry packet or other means; see Section 8.1 of [QUIC-TRANSPORT].
Any received 0-RTT data that the server responds to might be due to a replay attack.
Therefore, the server's use of 1-RTT keys before the handshake is complete is limited to sending data. A server MUST NOT process incoming 1-RTT protected packets before the TLS handshake is complete. Because sending acknowledgments indicates that all frames in a packet have been processed, a server cannot send acknowledgments for 1-RTT packets until the TLS handshake is complete. Received packets protected with 1-RTT keys MAY be stored and later decrypted and used once the handshake is complete.
| Note: TLS implementations might provide all 1-RTT secrets prior | to handshake completion. Even where QUIC implementations have | 1-RTT read keys, those keys are not to be used prior to | completing the handshake.The requirement for the server to wait for the client Finished message creates a dependency on that message being delivered. A client can avoid the potential for head-of-line blocking that this implies by sending its 1-RTT packets coalesced with a Handshake packet containing a copy of the CRYPTO frame that carries the Finished message, until one of the Handshake packets is acknowledged. This enables immediate server processing for those packets.
A server could receive packets protected with 0-RTT keys prior to receiving a TLS ClientHello. The server MAY retain these packets for later decryption in anticipation of receiving a ClientHello.
A client generally receives 1-RTT keys at the same time as the handshake completes. Even if it has 1-RTT secrets, a client MUST NOT process incoming 1-RTT protected packets before the TLS handshake is complete.
But what if:
- The packet containing handshake done from server to client is lost/reordered
- At the same time, there are many protected packets sent from server
- Client still have not confirmed the handshake, thus all the protected packets are pushed to deferred packets buffer
- Loss detection triggered, another handshake done packet is sent (or ack-elicit? from our observation PTO is triggered after 2 seconds)
- Since deferred packet count is reached (16 max at the moment), all the subsequent packets after QUIC_MAX_PENDING_DATAGRAMS was hit will be dropped, including any packets containing handshake information
- Deferred packet buffer is still full whatsoever with no progression at all
- Handshake timeout is triggered, the connection was closed
- Loss detection is triggered, but this time it is already too late
We have this particular problem because we are simulating high loss, moderate-high latency and high bandwidth environment in our lab testing, and this issue came out consistently.
Also, in this particular sense, because the deferred packets is a bounded buffer, it also can be seen as a semaphore, so this is why we considered it a livelock as well
Hi @stevefan1999-personal,
I think we are in agreement that if handshakes packets are lost or re-ordered and many protected packets are received, the packets might be dropped. And if the handshake done packet is lost and can't be re-sent in time, the connection will fail.
This can always happen as a corner case, and while the implementation makes some effort to mitigate it by deferring some packets, it must be balanced with maintaining existing connections.
I am not sure I understand the discussion around HANDSHAKE_DONE, etc. Packets are deferred as long as the endpoint doesn't have the keys to decrypt those packets. But HANDSHAKE_DONE is not preventing the endpoint from decrypting 1-RTT packets (it is in 1-RTT packets).
In any case, we would welcome a contribution that improve the connection establishment when the loss-rate / reordering rate is high, but it must not be at the expense of existing connections.
I think we are in agreement that if handshakes packets are lost or re-ordered and many protected packets are received, the packets might be dropped. And if the handshake done packet is lost and can't be re-sent in time, the connection will fail.
This can always happen as a corner case, and while the implementation makes some effort to mitigate it by deferring some packets, it must be balanced with maintaining existing connections.
I am not sure I understand the discussion around HANDSHAKE_DONE, etc. Packets are deferred as long as the endpoint doesn't have the keys to decrypt those packets. But HANDSHAKE_DONE is not preventing the endpoint from decrypting 1-RTT packets (it is in 1-RTT packets).
In any case, we would welcome a contribution that improve the connection establishment when the loss-rate / reordering rate is high, but it must not be at the expense of existing connections.
We have this particular setup that can produce a lot of deferred datagram:
- name: top_loss_with_fixed_rtt
topology_type: linear
nodes: 2
array_description:
- link_loss:
init_value: [1]
step_len: 0
step_num: 1
- link_latency:
init_value: [3]
- link_jitter:
init_value: [15]
- link_bandwidth_forward:
init_value: [1000]
- link_bandwidth_backward:
init_value: [1000]
That means with 1% packet loss, 3ms latency and 15ms of jitter with 1000Mbps up/downlink. With this setup we can reach up to 381 deferred packets.
https://github.com/microsoft/msquic/blob/0a879cf8038f7d25a61a8bfca9a299fe1edfc9a6/src/core/quicdef.h#L214
We have raised this value to 1024 and the handeshake problem is effectively gone. We believe this value, after constant expression evaluation, which is (10 + 5), is too conservative.
It is also noted that QUIC_INITIAL_WINDOW_PACKETS is a constant 10, but the actual settings can be changed on runtime as Settings->InitialWindowPackets. It is not sure whether it is intentional to fix the window size to 15 or not. We do have set the initial window packets to 400 for out-of-band cwnd optimization.
Thanks for the new PR.
Before we move along with it, there are a few questions I am trying to answer:
- understanding why the QUIC_MAX_PENDING_DATAGRAMS value was chosen (mention in the RFC? etc...)
- are there consequences of deferring more packets
We could also consider a new setting, allowing the app to set the number of accepted deferred packets independently from the initial congestion window.
When having the issue in your setup, were you already using the InitialWindowPackets setting?