linkerd2 icon indicating copy to clipboard operation
linkerd2 copied to clipboard

No configurable option for HTTP/1 idle timeout on keep-alive connections

Open gamerslouis opened this issue 8 months ago • 1 comments

What is the issue?

HTTP/1.1 keep-alive connections are being prematurely closed by the Linkerd proxy after 30 seconds of idleness, and there is currently no way to configure this timeout.

We are using [VictoriaMetrics](https://victoriametrics.com/) to collect metrics from services running inside a Kubernetes cluster. Many of our scrape intervals are configured to 60 seconds. We’ve observed that the vmagent scraper is encountering frequent connection closures from the Linkerd sidecar:

INFO ThreadId(01) linkerd_app_core::serve: Connection closed error=read header from client timeout client.addr=10.244.0.13:37704 server.addr=xxx.xxx.xxx.xxx:xxx

This indicates that persistent HTTP/1 connections used by vmagent to scrape metrics are being closed before the next request is sent. Capturing traffic with tcpdump shows that vmagent uses HTTP keep-alive and only sends requests every 60 seconds, which causes the HTTP/1 connection to remain idle for longer than Linkerd's 30-second default.

When Linkerd is enabled, these connections are closed after 30 seconds of inactivity. From debug logs, we see the following error:

[  1861.475759s] DEBUG ThreadId(01) linkerd_proxy_http::server: The client is shutting down the connection res=Err(hyper::Error(HeaderTimeout))

We traced this to the underlying [Hyper](https://github.com/hyperium/hyper/blob/c88df7886c74a1ade69c0b4c68eaf570c8111622/src/server/conn/http1.rs#L79) implementation used by Linkerd. Hyper's Builder sets a default h1_header_read_timeout of 30 seconds. This timeout is triggered when no new headers are received during that period on an idle HTTP/1 connection.

h1_header_read_timeout: Dur::Default(Some(Duration::from_secs(30))),

Currently, the Linkerd proxy does not expose this setting, and there’s no way to override it via annotations or configuration.

How can it be reproduced?

Establish an HTTP/1.1 keep-alive connection (e.g., using Python’s requests.Session) and leave it idle without sending further requests for over 30 seconds. The Linkerd proxy will close the connection with a HeaderTimeout error.

Logs, error output, etc

[450882.624436s] INFO ThreadId(01) outbound: linkerd_app_core::serve: Connection closed error=read header from client timeout client.addr=10.246.13.59:50568 server.addr=10.246.12.16:9793 [450881.544872s] INFO ThreadId(01) outbound: linkerd_app_core::serve: Connection closed error=read header from client timeout client.addr=10.246.13.59:47154 server.addr=10.246.15.207:9180 [450878.959567s] INFO ThreadId(01) outbound: linkerd_app_core::serve: Connection closed error=read header from client timeout client.addr=10.246.13.59:44612 server.addr=10.246.14.56:9793 [450878.672225s] INFO ThreadId(01) outbound: linkerd_app_core::serve: Connection closed error=read header from client timeout client.addr=10.246.13.59:44486 server.addr=10.246.9.45:9180

output of linkerd check -o short

linkerd-version
---------------
‼ cli is up-to-date
    is running version 25.5.5 but the latest edge version is 25.6.2
    see https://linkerd.io/2/checks/#l5d-version-cli for hints

control-plane-version
---------------------
‼ control plane is up-to-date
    is running version 25.5.5 but the latest edge version is 25.6.2
    see https://linkerd.io/2/checks/#l5d-version-control for hints

linkerd-control-plane-proxy
---------------------------
‼ control plane proxies are up-to-date
    some proxies are not running the current version:
	* linkerd-destination-97c46bc4d-mst8p (edge-25.5.5)
	* linkerd-destination-97c46bc4d-r4pdm (edge-25.5.5)
	* linkerd-destination-97c46bc4d-vsfhf (edge-25.5.5)
	* linkerd-identity-7b5c4b4f75-gvgw5 (edge-25.5.5)
	* linkerd-identity-7b5c4b4f75-l6hln (edge-25.5.5)
	* linkerd-identity-7b5c4b4f75-wsj9m (edge-25.5.5)
	* linkerd-proxy-injector-c58469d8f-257lh (edge-25.5.5)
	* linkerd-proxy-injector-c58469d8f-8f4v2 (edge-25.5.5)
	* linkerd-proxy-injector-c58469d8f-tcqv4 (edge-25.5.5)
    see https://linkerd.io/2/checks/#l5d-cp-proxy-version for hints

Status check results are √

Environment

  • Kubernetes Version: 1.32
  • Cluster Environment: RKE2
  • Host OS: Ubuntu
  • Linkerd Version: 25.5.5

Possible solution

Expose the h1_header_read_timeout setting in the Linkerd proxy configuration, possibly via annotations or config fields, so that users can adjust the HTTP/1 idle timeout behavior to better suit long-polling or low-frequency scraping use cases.

Additional context

No response

Would you like to work on fixing this bug?

maybe

gamerslouis avatar Jun 17 '25 13:06 gamerslouis

We're hitting this issue as well with Prometheus scrapers that have 1m scrape intervals (and other similar requests).

We've tried every related configuration linkerd exposes, including timeout.linkerd.io/idle and it didn't improve.

Thanks.

travisjeffery avatar Jun 17 '25 13:06 travisjeffery

We're spotting the same issue on edge-25.5.5 between our applications and our self-hosted Elastic APM server.

danihengeveld avatar Jun 24 '25 12:06 danihengeveld

thank you all for filing this issue, and confirming the veracity of this report.

we were able to identify a change in the default behavior of hyper, our HTTP library, as being responsible for this. hyper introduced a 30-second idle timeout for http/1.1 connections in its 1.0 release, while the 0.14 release being used prior to v2.288 did not enforce such a timeout.

pull request linkerd/linkerd2-proxy#3985 removed this timeout, restoring the behavior that was present prior to the v2.288 release of the proxy. this issue should now be fixed in proxy release v2.304, which is included in the edge-25.6.4 release of linkerd2, see: https://github.com/linkerd/linkerd2/releases/tag/edge-25.6.4

i'll refrain from closing this issue for now, to leave space for others to consider making this timeout configurable, but for now the connection errors being logged in this report should no longer occur.

cratelyn avatar Jun 26 '25 20:06 cratelyn

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Sep 25 '25 03:09 stale[bot]