linkerd2
linkerd2 copied to clipboard
proxy: Investigate hyper's `retry_canceled_requests` behavior
Per #9272, it seems that hyper's hyper::client::Builder::retry_canceled_requests
doesn't work as we might expect. Specifically, if a server applies a header read timeout and a client makes requests at an interval that matches the timeout, the client sees connection errors instead of the connection being reconstructed. We should try to understand this behavior better and, ideally, we should try to fix this in Hyper if possible.
We should setup a server like:
let server = hyper::server::Server::try_bind(&addr)?
// Prevent port scanners, etc, from holding connections open.
.http1_header_read_timeout(timeout);
and then use a hyper client to send a request to the server at an interval that matches timeout
to see if we can craft a small reproduction of this issue. If we can't reproduce with a hyper server, we should be able to reproduce with a Go HTTP server.
Hyper is behaving as documented here. The retry_canceled_requests
setting only retries requests on connections that were closed before Hyper began writing a new request on that connection. Per the docs:
Set whether to retry requests that get disrupted before ever starting to write.
This means a request that is queued, and gets given an idle, reused connection, and then encounters an error immediately as the idle connection was found to be unusable.
When this is set to
false
, the relatedResponseFuture
would instead resolve to anError::Cancel.
This behavior is not reproducible with a Hyper client and a Hyper server. In fact, it's only reproducible with a Hyper client talking to a Go net/http
server. This is because, due to a bug in net/http
(golang/go#54784), the Go server considers all time which has elapsed since the end of the previous request on a connection against the (so-called) ReadHeaderTimeout
, even though that time was not spent reading headers. When the retry_canceled_requests
setting is enabled in Hyper, the Hyper client will retry requests on connections torn down by net/http
's erroneous timeout behavior...if the connection was closed before Hyper started sending a new request on it.
However, when the Go server's timeout and the interval that the Hyper client is sending requests are exactly in sync (such as with the every-ten-seconds Prometheus scrapes in #9272), there's a race between Hyper starting to send a new request and the Go server closing the connection. If the Go server closes the connection before Hyper has written the request head, then Hyper will retry the connection. However, if Hyper has written the request head and then sees an EOF from the Go server tearing down the connection, the retry_canceled_requests
configuration does not apply, as Hyper does not consider that request "canceled". It instead propagates the connection closed error in that case.
So, how do we mitigate this? I have an upstream PR to Go to fix the incorrect behavior for ReadHeaderTimeout
: golang/go#54785, which will hopefully be merged. If we want to be more resilient to this kind of issue in general, though, I don't think it would make sense to change Hyper's retry_canceled_requests
behavior, since it's doing what it says it does in the documentation. Instead, we would need to consider adding more general-purpose retries to the inbound proxy.
Since these retries would not just be the simple case of retrying requests on conns that were torn down while idle in the connection pool, we would want a "real" retry behavior including things like a retry budget, backoffs, etc, the way we implement retries on outbound proxies. Which...we could do. However, a simpler solution for this problem in particular is to just change our header read timeout on Go servers to be greater than Prometheus' scrape interval (which #9272 already did) and...hope that Go eventually merges my PR to fix the incorrect timeout behavior in net/http
.
This behavior is not reproducible with a Hyper client and a Hyper server.
If we ignore the specific case of http1_header_read_timeout
, any server that closes a connection can cause this behavior. It so happens that Hyper servers do not provide any sort of idle timeout configuration (which seems like a deficiency--there should be a way to prevent clients from holding idle connections open indefinitely). I.e., even with your fix to Go's header timeout, Go's idle timeout can trigger this problem.
I think the takeaway here is that we need some sort of retries implemented on inbound requests that only apply when a socket error is encountered.
Yeah, I think that sounds right.
One problem with this approach is that not all requests are safe to retry: once we've written a request, I don't think an OS error means that the request hasn't been processed by the application.
Instead, we might consider changing the connection pooling behavior (and even making it configurable):
- HTTP/2 connections should generally be safe to reuse: we have a much better chance of detecting the health of a connection before sending data on it (i.e. via PING frames and goaway messages). OS errors are always possible, but these are generally errors that an application would have observed otherwise.
- It's important that use sufficiently large flow control windows on the inbound client so that multiple peers can safely multiplex without backpressure (by default). Perhaps this should be derived from the available process memory (i.e. so that memory limit constraints play into this calculation)?
- For meshed (transported via HTTP/2) and unmeshed inbound HTTP/1 traffic, we need to set a pool idle timeout that's less than the idle timeout of the server. The default 90s timeout is probably way too big a default for us. Given Kubernetes' default 10s probe interval, we probably want to be less than that by default.
- In the short term, we could make the pooling behavior configurable via annotations on the
Server
resource. - Eventually, we may want to support a CRD that attaches to servers, etc, to control HTTP server behavior. Needs more design.
Proposal
2.12.2
Change our inbound HTTP/1 client to set a 3s pool timeout. This should avoid the majority of probe issues. We should document this timeout and advise that servers set an idle timeout >3s.
2.13.0
Make this timeout configurable via the policy controller, controlled by an annotation set on a Server or Pod.
https://github.com/linkerd/linkerd2-proxy/pull/1931
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.