Linkerd Destination Pods Readiness and Liveness Probes failures
What is the issue?
Linkerd Destination Pods are Continuously giving Unhealthy Warnings and I was able to see this in Multiple releases. Readiness and Liveness probes are failing continuously. I have raised this in the past and I thought readiness/liveness timeout seconds was very less and now I tried increasing it to 10 seconds for all the probes and still the issue persists. At some point I thought the failures will stop but they are not.
How can it be reproduced?
Install the edge-25.2.1 version of helm charts and monitor the destination pods for some time.
Logs, error output, etc
linkerd destination container logs:
time="2025-03-22T13:02:11Z" level=info msg="running version edge-25.2.1" time="2025-03-22T13:02:11Z" level=info msg="starting admin server on :9996" time="2025-03-22T13:02:11Z" level=info msg="Using default opaque ports: map[25:{} 587:{} 3306:{} 4444:{} 5432:{} 6379:{} 9300:{} 11211:{}]" time="2025-03-22T13:02:11Z" level=warning msg="failed to register Prometheus gauge Desc{fqName: \"job_cache_size\", help: \"Number of items in the client-go job cache\", constLabels: {cluster=\"local\"}, variableLabels: {}}: duplicate metrics collector registration attempted" time="2025-03-22T13:02:11Z" level=info msg="waiting for caches to sync" time="2025-03-22T13:02:11Z" level=info msg="caches synced" time="2025-03-22T13:02:11Z" level=info msg="waiting for caches to sync" time="2025-03-22T13:02:11Z" level=info msg="caches synced" time="2025-03-22T13:02:11Z" level=info msg="waiting for caches to sync" time="2025-03-22T13:02:11Z" level=info msg="caches synced" time="2025-03-22T13:02:11Z" level=info msg="starting gRPC server on :8086" time="2025-03-22T13:02:11Z" level=info msg="attempting to acquire leader lease linkerd/linkerd-destination-endpoint-write..."
Linkerd proxy container logs
[ 0.001757s] INFO ThreadId(01) linkerd2_proxy: release 2.280.0 (b2e8623) by linkerd on 2025-02-12T15:16:03Z [ 0.004731s] INFO ThreadId(01) linkerd2_proxy::rt: Using single-threaded proxy runtime [ 0.005998s] INFO ThreadId(01) linkerd2_proxy: Admin interface on 0.0.0.0:4191 [ 0.006025s] INFO ThreadId(01) linkerd2_proxy: Inbound interface on 0.0.0.0:4143 [ 0.006029s] INFO ThreadId(01) linkerd2_proxy: Outbound interface on 127.0.0.1:4140 [ 0.006032s] INFO ThreadId(01) linkerd2_proxy: Tap interface on 0.0.0.0:4190 [ 0.006035s] INFO ThreadId(01) linkerd2_proxy: SNI is linkerd-destination.linkerd.serviceaccount.identity.linkerd.cluster.local [ 0.006039s] INFO ThreadId(01) linkerd2_proxy: Local identity is linkerd-destination.linkerd.serviceaccount.identity.linkerd.cluster.local [ 0.006041s] INFO ThreadId(01) linkerd2_proxy: Destinations resolved via localhost:8086 [ 0.006247s] INFO ThreadId(01) dst:controller{addr=localhost:8086}: linkerd_pool_p2c: Adding endpoint addr=127.0.0.1:8086 [ 0.006500s] INFO ThreadId(01) policy:controller{addr=localhost:8090}: linkerd_pool_p2c: Adding endpoint addr=127.0.0.1:8090 [ 0.006812s] WARN ThreadId(01) dst:controller{addr=localhost:8086}:endpoint{addr=127.0.0.1:8086}: linkerd_reconnect: Failed to connect error=endpoint 127.0.0.1:8086: Connection refused (os error 111) error.sources=[Connection refused (os error 111)] [ 0.006843s] WARN ThreadId(01) policy:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_reconnect: Failed to connect error=endpoint 127.0.0.1:8090: Connection refused (os error 111) error.sources=[Connection refused (os error 111)] [ 0.008967s] INFO ThreadId(02) identity:identity{server.addr=linkerd-identity-headless.linkerd.svc.cluster.local:8080}:controller{addr=linkerd-identity-headless.linkerd.svc.cluster.local:8080}: linkerd_pool_p2c: Adding endpoint addr=10.xy.xy.240:8080 [ 0.008994s] INFO ThreadId(02) identity:identity{server.addr=linkerd-identity-headless.linkerd.svc.cluster.local:8080}:controller{addr=linkerd-identity-headless.linkerd.svc.cluster.local:8080}: linkerd_pool_p2c: Adding endpoint addr=10.xy.xy.81:8080 [ 0.009001s] INFO ThreadId(02) identity:identity{server.addr=linkerd-identity-headless.linkerd.svc.cluster.local:8080}:controller{addr=linkerd-identity-headless.linkerd.svc.cluster.local:8080}: linkerd_pool_p2c: Adding endpoint addr=10.xy.xy.76:8080 [ 0.015464s] INFO ThreadId(02) daemon:identity: linkerd_app: Certified identity id=linkerd-destination.linkerd.serviceaccount.identity.linkerd.cluster.local [ 0.113257s] WARN ThreadId(01) dst:controller{addr=localhost:8086}:endpoint{addr=127.0.0.1:8086}: linkerd_reconnect: Failed to connect error=endpoint 127.0.0.1:8086: Connection refused (os error 111) error.sources=[Connection refused (os error 111)] [ 0.115566s] WARN ThreadId(01) policy:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_reconnect: Failed to connect error=endpoint 127.0.0.1:8090: Connection refused (os error 111) error.sources=[Connection refused (os error 111)] [ 0.323055s] WARN ThreadId(01) dst:controller{addr=localhost:8086}:endpoint{addr=127.0.0.1:8086}: linkerd_reconnect: Failed to connect error=endpoint 127.0.0.1:8086: Connection refused (os error 111) error.sources=[Connection refused (os error 111)] [ 0.328390s] WARN ThreadId(01) policy:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_reconnect: Failed to connect error=endpoint 127.0.0.1:8090: Connection refused (os error 111) error.sources=[Connection refused (os error 111)] [ 0.730799s] WARN ThreadId(01) policy:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_reconnect: Failed to connect error=endpoint 127.0.0.1:8090: Connection refused (os error 111) error.sources=[Connection refused (os error 111)] [ 10918.457164s] WARN ThreadId(01) inbound: linkerd_app_core::serve: Server failed to accept connection error=failed to obtain peer address: Transport endpoint is not connected (os error 107) error.sources=[Transport endpoint is not connected (os error 107)]
policy container logs
2025-03-22T20:41:49.075616Z INFO status_controller: linkerd_policy_controller_k8s_status::index: Status controller leadership change leader=false 2025-03-22T20:44:21.753042Z WARN hyper::proto::h1::io: read header from client timeout 2025-03-22T20:44:31.753560Z WARN hyper::proto::h1::io: read header from client timeout 2025-03-22T20:48:01.754612Z WARN hyper::proto::h1::io: read header from client timeout 2025-03-22T20:48:11.753781Z WARN hyper::proto::h1::io: read header from client timeout 2025-03-22T20:49:11.754677Z WARN hyper::proto::h1::io: read header from client timeout 2025-03-22T20:49:51.753360Z WARN hyper::proto::h1::io: read header from client timeout 2025-03-22T20:50:21.753792Z WARN hyper::proto::h1::io: read header from client timeout 2025-03-22T20:50:51.753769Z WARN hyper::proto::h1::io: read header from client timeout 2025-03-22T20:53:01.754054Z WARN hyper::proto::h1::io: read header from client timeout 2025-03-22T20:56:01.753195Z WARN hyper::proto::h1::io: read header from client timeout 2025-03-22T20:56:21.753575Z WARN hyper::proto::h1::io: read header from client timeout 2025-03-22T20:58:31.753550Z WARN hyper::proto::h1::io: read header from client timeout 2025-03-22T21:03:21.752803Z WARN hyper::proto::h1::io: read header from client timeout 2025-03-22T21:06:01.752791Z WARN hyper::proto::h1::io: read header from client timeout 2025-03-22T21:08:51.753898Z WARN hyper::proto::h1::io: read header from client timeout 2025-03-22T21:09:41.752881Z WARN hyper::proto::h1::io: read header from client timeout 2025-03-22T21:12:01.754568Z WARN hyper::proto::h1::io: read header from client timeout 2025-03-22T21:12:11.754222Z WARN hyper::proto::h1::io: read header from client timeout 2025-03-22T21:13:41.754682Z WARN hyper::proto::h1::io: read header from client timeout 2025-03-22T21:18:01.753576Z WARN hyper::proto::h1::io: read header from client timeout 2025-03-22T21:18:41.754531Z WARN hyper::proto::h1::io: read header from client timeout 2025-03-22T21:20:21.753428Z WARN hyper::proto::h1::io: read header from client timeout 2025-03-22T21:21:11.754590Z WARN hyper::proto::h1::io: read header from client timeout 2025-03-22T21:37:01.754613Z WARN hyper::proto::h1::io: read header from client timeout 2025-03-22T21:38:11.753775Z WARN hyper::proto::h1::io: read header from client timeout 2025-03-22T21:39:11.754669Z WARN hyper::proto::h1::io: read header from client timeout 2025-03-22T21:42:11.753430Z WARN hyper::proto::h1::io: read header from client timeout 2025-03-22T21:42:41.754479Z WARN hyper::proto::h1::io: read header from client timeout 2025-03-22T21:43:11.754474Z WARN hyper::proto::h1::io: read header from client timeout
output of linkerd check -o short
`Linkerd` check output
`linkerd check
kubernetes-api
--------------
√ can initialize the client
√ can query the Kubernetes API
kubernetes-version
------------------
√ is running the minimum Kubernetes API version
linkerd-existence
-----------------
√ 'linkerd-config' config map exists
√ heartbeat ServiceAccount exist
√ control plane replica sets are ready
√ no unschedulable pods
√ control plane pods are ready
√ cluster networks contains all node podCIDRs
√ cluster networks contains all pods
√ cluster networks contains all services
linkerd-config
--------------
√ control plane Namespace exists
√ control plane ClusterRoles exist
√ control plane ClusterRoleBindings exist
√ control plane ServiceAccounts exist
√ control plane CustomResourceDefinitions exist
√ control plane MutatingWebhookConfigurations exist
√ control plane ValidatingWebhookConfigurations exist
√ proxy-init container runs as root user if docker container runtime is used
linkerd-identity
----------------
√ certificate config is valid
√ trust anchors are using supported crypto algorithm
√ trust anchors are within their validity period
√ trust anchors are valid for at least 60 days
√ issuer cert is using supported crypto algorithm
√ issuer cert is within its validity period
‼ issuer cert is valid for at least 60 days
issuer certificate will expire on 2025-03-24T12:46:07Z
see https://linkerd.io/2/checks/#l5d-identity-issuer-cert-not-expiring-soon for hints
√ issuer cert is issued by the trust anchor
linkerd-webhooks-and-apisvc-tls
-------------------------------
√ proxy-injector webhook has valid cert
√ proxy-injector cert is valid for at least 60 days
√ sp-validator webhook has valid cert
√ sp-validator cert is valid for at least 60 days
√ policy-validator webhook has valid cert
√ policy-validator cert is valid for at least 60 days
linkerd-version
---------------
√ can determine the latest version
‼ cli is up-to-date
is running version 25.2.2 but the latest edge version is 25.3.3
see https://linkerd.io/2/checks/#l5d-version-cli for hints
control-plane-version
---------------------
√ can retrieve the control plane version
‼ control plane is up-to-date
is running version 25.2.1 but the latest edge version is 25.3.3
see https://linkerd.io/2/checks/#l5d-version-control for hints
‼ control plane and cli versions match
control plane running edge-25.2.1 but cli running edge-25.2.2
see https://linkerd.io/2/checks/#l5d-version-control for hints
linkerd-control-plane-proxy
---------------------------
√ control plane proxies are healthy
‼ control plane proxies are up-to-date
some proxies are not running the current version:
* linkerd-destination-788574cf47-54kmc (edge-25.2.1)
* linkerd-destination-788574cf47-djgln (edge-25.2.1)
* linkerd-destination-788574cf47-fvr27 (edge-25.2.1)
* linkerd-identity-6d9b469976-79n5r (edge-25.2.1)
* linkerd-identity-6d9b469976-8kmz8 (edge-25.2.1)
* linkerd-identity-6d9b469976-qjkbf (edge-25.2.1)
* linkerd-proxy-injector-c46cd9cf5-5gvrq (edge-25.2.1)
* linkerd-proxy-injector-c46cd9cf5-85ph4 (edge-25.2.1)
* linkerd-proxy-injector-c46cd9cf5-xd7t2 (edge-25.2.1)
see https://linkerd.io/2/checks/#l5d-cp-proxy-version for hints
‼ control plane proxies and cli versions match
linkerd-destination-788574cf47-54kmc running edge-25.2.1 but cli running edge-25.2.2
see https://linkerd.io/2/checks/#l5d-cp-proxy-cli-version for hints
linkerd-ha-checks
-----------------
√ multiple replicas of control plane pods
linkerd-extension-checks
------------------------
√ namespace configuration for extensions
linkerd-viz
-----------
√ linkerd-viz Namespace exists
√ can initialize the client
√ linkerd-viz ClusterRoles exist
√ linkerd-viz ClusterRoleBindings exist
√ tap API server has valid cert
√ tap API server cert is valid for at least 60 days
√ tap API service is running
‼ linkerd-viz pods are injected
could not find proxy container for metrics-api-5cfcb4dd46-xn22l pod
see https://linkerd.io/2/checks/#l5d-viz-pods-injection for hints
‼ viz extension pods are running
container "linkerd-proxy" in pod "metrics-api-5cfcb4dd46-xn22l" is not ready
see https://linkerd.io/2/checks/#l5d-viz-pods-running for hints
‼ viz extension proxies are healthy
no "linkerd-proxy" containers found in the "linkerd" namespace
see https://linkerd.io/2/checks/#l5d-viz-proxy-healthy for hints
√ viz extension proxies are up-to-date
√ viz extension proxies and cli versions match
√ viz extension self-check
linkerd-smi
-----------
‼ Linkerd extension command linkerd-smi exists
exec: "linkerd-smi": executable file not found in $PATH
see https://linkerd.io/2/checks/#extensions for hints
Environment
Kubernetes Version: v1.28.15+rke2r1 Cluster Env: Rancher rke2 Hos Os: Oracle Linux Server 8.9 Linkerd Version: edge-25.2.1
Possible solution
No response
Additional context
Would you like to work on fixing this bug?
no
Running into this on enterprise-2.17.0 as well
{"timestamp":"2025-04-04T13:50:20.779634Z","level":"WARN","fields":{"message":"read header from client timeout"},"target":"hyper::proto::h1::io"}
{"timestamp":"2025-04-04T13:50:29.513913Z","level":"WARN","fields":{"message":"read header from client timeout"},"target":"hyper::proto::h1::io"}
{"timestamp":"2025-04-04T13:50:40.340148Z","level":"WARN","fields":{"message":"read header from client timeout"},"target":"hyper::proto::h1::io"}
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.