quinn
quinn copied to clipboard
bug(quinn): making a connection that never gets to a full `Connection` causes excessive delays in `Endpoint::wait_idle`
I would appreciate any help.
The bug is illustrated in the endpoint_wait_idle test in draft PR https://github.com/quinn-rs/quinn/pull/2146, but I will also copy the permalink:
https://github.com/quinn-rs/quinn/blob/dd14b01613d7825f9c4ca25c9cd10b076a077aa7/quinn/src/tests.rs#L154-L177
To summarize, if you have an Endpoint that attempts a connection, and for whatever reason, that connection does not make it past the Connecting stage, it causes the Endpoint::wait_idle call (when attempting to gracefully close the endpoint) to always take ~3s. I'm assuming something happens like the ConnectionHandler that is created when Connecting never gets cleaned up and so wait_idle only returns after a timeout.
I'm only familiar enough with quinn to know something is wrong, but not what to do to fix it.
What leads you to believe that this behavior is incorrect? My guess would be that this is exactly the draining period of the attempted connection. Waiting for that to complete is exactly the role of wait_idle.
I think the suggestion/surprise is that Connecting instances might not deserve the full draining period?
I assume what happens is that on dropping the Connecting a CONNECTION_CLOSE frame is sent, even though only the initial packet was ever sent. Since CONNECTION_CLOSE is ack-eliciting you then have to wait for 3*PTO to expire and the RTT for the PTO calculation will be set to the initial values?
This makes sense from a protocol point of view. From a user point of view I understand that it is a little surprising that the timeout happens at all in this case. So @djc's question is fair: would it be allowed to shorten the draining period? Would it be desirable?
I think the suggestion/surprise is that Connecting instances might not deserve the full draining period?
Yes, thank you, this is what is surprising for me. We have the "knowledge" that we don't have a Connection yet, since we have not reached that point in the life cycle (because we are still at Connecting), so it felt surprising that I would have to wait for that to close.
You don't have to; wait_idle is strictly optional. It does serve a useful purpose: it helps the peer dispose of any state it may have already allocated in a timely manner. The connection not having been fully established doesn't preclude the peer having allocated state.
You don't have to;
wait_idleis strictly optional. It does serve a useful purpose: it helps the peer dispose of any state it may have already allocated in a timely manner. The connection not having been fully established doesn't preclude the peer having allocated state.
But if the goal is to help the peer dispose of any state, arguably we might be able to use a shorter timeout? Like, I think you mean we should allow the Connecting to yield a CONNECTION_CLOSE message, but once that is done we probably don't need to wait for anything -- so a 250-500ms timeout might be sufficient for Connecting instances?
Why would we be able to use a shorter timeout? As far as I can go, a client connection with a handshake in progress isn't in any way special as far as lifetime/connection-level resource concerns go.
On our side we've improved the situation quite a bit by changing the transport parameters for just these connections where we don't care about when they don't close properly (because in those cases, it's likely a network path issue). Specifically, we've reduced the initial RTT estimate to 111ms, which turns into a 1s timeout waiting for a close ACK in the worst case. Underestimating the RTT would cause low throughput initially, but that's fine for these connections we're doing, there's only ever very little data sent over them.
Anyhow, sorry for the fuss, perhaps the above is helpful for anyone else coming across this.