linkerd2 icon indicating copy to clipboard operation
linkerd2 copied to clipboard

Linkerd Destination Pods Readiness and Liveness Probes failures

Open jaswanth9522 opened this issue 9 months ago • 1 comments

What is the issue?

Linkerd Destination Pods are Continuously giving Unhealthy Warnings and I was able to see this in Multiple releases. Readiness and Liveness probes are failing continuously. I have raised this in the past and I thought readiness/liveness timeout seconds was very less and now I tried increasing it to 10 seconds for all the probes and still the issue persists. At some point I thought the failures will stop but they are not.

How can it be reproduced?

Install the edge-25.2.1 version of helm charts and monitor the destination pods for some time.

Logs, error output, etc

linkerd destination container logs:

time="2025-03-22T13:02:11Z" level=info msg="running version edge-25.2.1" time="2025-03-22T13:02:11Z" level=info msg="starting admin server on :9996" time="2025-03-22T13:02:11Z" level=info msg="Using default opaque ports: map[25:{} 587:{} 3306:{} 4444:{} 5432:{} 6379:{} 9300:{} 11211:{}]" time="2025-03-22T13:02:11Z" level=warning msg="failed to register Prometheus gauge Desc{fqName: \"job_cache_size\", help: \"Number of items in the client-go job cache\", constLabels: {cluster=\"local\"}, variableLabels: {}}: duplicate metrics collector registration attempted" time="2025-03-22T13:02:11Z" level=info msg="waiting for caches to sync" time="2025-03-22T13:02:11Z" level=info msg="caches synced" time="2025-03-22T13:02:11Z" level=info msg="waiting for caches to sync" time="2025-03-22T13:02:11Z" level=info msg="caches synced" time="2025-03-22T13:02:11Z" level=info msg="waiting for caches to sync" time="2025-03-22T13:02:11Z" level=info msg="caches synced" time="2025-03-22T13:02:11Z" level=info msg="starting gRPC server on :8086" time="2025-03-22T13:02:11Z" level=info msg="attempting to acquire leader lease linkerd/linkerd-destination-endpoint-write..."

Linkerd proxy container logs

[ 0.001757s] INFO ThreadId(01) linkerd2_proxy: release 2.280.0 (b2e8623) by linkerd on 2025-02-12T15:16:03Z [ 0.004731s] INFO ThreadId(01) linkerd2_proxy::rt: Using single-threaded proxy runtime [ 0.005998s] INFO ThreadId(01) linkerd2_proxy: Admin interface on 0.0.0.0:4191 [ 0.006025s] INFO ThreadId(01) linkerd2_proxy: Inbound interface on 0.0.0.0:4143 [ 0.006029s] INFO ThreadId(01) linkerd2_proxy: Outbound interface on 127.0.0.1:4140 [ 0.006032s] INFO ThreadId(01) linkerd2_proxy: Tap interface on 0.0.0.0:4190 [ 0.006035s] INFO ThreadId(01) linkerd2_proxy: SNI is linkerd-destination.linkerd.serviceaccount.identity.linkerd.cluster.local [ 0.006039s] INFO ThreadId(01) linkerd2_proxy: Local identity is linkerd-destination.linkerd.serviceaccount.identity.linkerd.cluster.local [ 0.006041s] INFO ThreadId(01) linkerd2_proxy: Destinations resolved via localhost:8086 [ 0.006247s] INFO ThreadId(01) dst:controller{addr=localhost:8086}: linkerd_pool_p2c: Adding endpoint addr=127.0.0.1:8086 [ 0.006500s] INFO ThreadId(01) policy:controller{addr=localhost:8090}: linkerd_pool_p2c: Adding endpoint addr=127.0.0.1:8090 [ 0.006812s] WARN ThreadId(01) dst:controller{addr=localhost:8086}:endpoint{addr=127.0.0.1:8086}: linkerd_reconnect: Failed to connect error=endpoint 127.0.0.1:8086: Connection refused (os error 111) error.sources=[Connection refused (os error 111)] [ 0.006843s] WARN ThreadId(01) policy:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_reconnect: Failed to connect error=endpoint 127.0.0.1:8090: Connection refused (os error 111) error.sources=[Connection refused (os error 111)] [ 0.008967s] INFO ThreadId(02) identity:identity{server.addr=linkerd-identity-headless.linkerd.svc.cluster.local:8080}:controller{addr=linkerd-identity-headless.linkerd.svc.cluster.local:8080}: linkerd_pool_p2c: Adding endpoint addr=10.xy.xy.240:8080 [ 0.008994s] INFO ThreadId(02) identity:identity{server.addr=linkerd-identity-headless.linkerd.svc.cluster.local:8080}:controller{addr=linkerd-identity-headless.linkerd.svc.cluster.local:8080}: linkerd_pool_p2c: Adding endpoint addr=10.xy.xy.81:8080 [ 0.009001s] INFO ThreadId(02) identity:identity{server.addr=linkerd-identity-headless.linkerd.svc.cluster.local:8080}:controller{addr=linkerd-identity-headless.linkerd.svc.cluster.local:8080}: linkerd_pool_p2c: Adding endpoint addr=10.xy.xy.76:8080 [ 0.015464s] INFO ThreadId(02) daemon:identity: linkerd_app: Certified identity id=linkerd-destination.linkerd.serviceaccount.identity.linkerd.cluster.local [ 0.113257s] WARN ThreadId(01) dst:controller{addr=localhost:8086}:endpoint{addr=127.0.0.1:8086}: linkerd_reconnect: Failed to connect error=endpoint 127.0.0.1:8086: Connection refused (os error 111) error.sources=[Connection refused (os error 111)] [ 0.115566s] WARN ThreadId(01) policy:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_reconnect: Failed to connect error=endpoint 127.0.0.1:8090: Connection refused (os error 111) error.sources=[Connection refused (os error 111)] [ 0.323055s] WARN ThreadId(01) dst:controller{addr=localhost:8086}:endpoint{addr=127.0.0.1:8086}: linkerd_reconnect: Failed to connect error=endpoint 127.0.0.1:8086: Connection refused (os error 111) error.sources=[Connection refused (os error 111)] [ 0.328390s] WARN ThreadId(01) policy:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_reconnect: Failed to connect error=endpoint 127.0.0.1:8090: Connection refused (os error 111) error.sources=[Connection refused (os error 111)] [ 0.730799s] WARN ThreadId(01) policy:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_reconnect: Failed to connect error=endpoint 127.0.0.1:8090: Connection refused (os error 111) error.sources=[Connection refused (os error 111)] [ 10918.457164s] WARN ThreadId(01) inbound: linkerd_app_core::serve: Server failed to accept connection error=failed to obtain peer address: Transport endpoint is not connected (os error 107) error.sources=[Transport endpoint is not connected (os error 107)]

policy container logs 2025-03-22T20:41:49.075616Z INFO status_controller: linkerd_policy_controller_k8s_status::index: Status controller leadership change leader=false 2025-03-22T20:44:21.753042Z WARN hyper::proto::h1::io: read header from client timeout 2025-03-22T20:44:31.753560Z WARN hyper::proto::h1::io: read header from client timeout 2025-03-22T20:48:01.754612Z WARN hyper::proto::h1::io: read header from client timeout 2025-03-22T20:48:11.753781Z WARN hyper::proto::h1::io: read header from client timeout 2025-03-22T20:49:11.754677Z WARN hyper::proto::h1::io: read header from client timeout 2025-03-22T20:49:51.753360Z WARN hyper::proto::h1::io: read header from client timeout 2025-03-22T20:50:21.753792Z WARN hyper::proto::h1::io: read header from client timeout 2025-03-22T20:50:51.753769Z WARN hyper::proto::h1::io: read header from client timeout 2025-03-22T20:53:01.754054Z WARN hyper::proto::h1::io: read header from client timeout 2025-03-22T20:56:01.753195Z WARN hyper::proto::h1::io: read header from client timeout 2025-03-22T20:56:21.753575Z WARN hyper::proto::h1::io: read header from client timeout 2025-03-22T20:58:31.753550Z WARN hyper::proto::h1::io: read header from client timeout 2025-03-22T21:03:21.752803Z WARN hyper::proto::h1::io: read header from client timeout 2025-03-22T21:06:01.752791Z WARN hyper::proto::h1::io: read header from client timeout 2025-03-22T21:08:51.753898Z WARN hyper::proto::h1::io: read header from client timeout 2025-03-22T21:09:41.752881Z WARN hyper::proto::h1::io: read header from client timeout 2025-03-22T21:12:01.754568Z WARN hyper::proto::h1::io: read header from client timeout 2025-03-22T21:12:11.754222Z WARN hyper::proto::h1::io: read header from client timeout 2025-03-22T21:13:41.754682Z WARN hyper::proto::h1::io: read header from client timeout 2025-03-22T21:18:01.753576Z WARN hyper::proto::h1::io: read header from client timeout 2025-03-22T21:18:41.754531Z WARN hyper::proto::h1::io: read header from client timeout 2025-03-22T21:20:21.753428Z WARN hyper::proto::h1::io: read header from client timeout 2025-03-22T21:21:11.754590Z WARN hyper::proto::h1::io: read header from client timeout 2025-03-22T21:37:01.754613Z WARN hyper::proto::h1::io: read header from client timeout 2025-03-22T21:38:11.753775Z WARN hyper::proto::h1::io: read header from client timeout 2025-03-22T21:39:11.754669Z WARN hyper::proto::h1::io: read header from client timeout 2025-03-22T21:42:11.753430Z WARN hyper::proto::h1::io: read header from client timeout 2025-03-22T21:42:41.754479Z WARN hyper::proto::h1::io: read header from client timeout 2025-03-22T21:43:11.754474Z WARN hyper::proto::h1::io: read header from client timeout

output of linkerd check -o short

`Linkerd` check output
`linkerd check
kubernetes-api
--------------
√ can initialize the client
√ can query the Kubernetes API

kubernetes-version
------------------
√ is running the minimum Kubernetes API version

linkerd-existence
-----------------
√ 'linkerd-config' config map exists
√ heartbeat ServiceAccount exist
√ control plane replica sets are ready
√ no unschedulable pods
√ control plane pods are ready
√ cluster networks contains all node podCIDRs
√ cluster networks contains all pods
√ cluster networks contains all services

linkerd-config
--------------
√ control plane Namespace exists
√ control plane ClusterRoles exist
√ control plane ClusterRoleBindings exist
√ control plane ServiceAccounts exist
√ control plane CustomResourceDefinitions exist
√ control plane MutatingWebhookConfigurations exist
√ control plane ValidatingWebhookConfigurations exist
√ proxy-init container runs as root user if docker container runtime is used

linkerd-identity
----------------
√ certificate config is valid
√ trust anchors are using supported crypto algorithm
√ trust anchors are within their validity period
√ trust anchors are valid for at least 60 days
√ issuer cert is using supported crypto algorithm
√ issuer cert is within its validity period
‼ issuer cert is valid for at least 60 days
    issuer certificate will expire on 2025-03-24T12:46:07Z
    see https://linkerd.io/2/checks/#l5d-identity-issuer-cert-not-expiring-soon for hints
√ issuer cert is issued by the trust anchor

linkerd-webhooks-and-apisvc-tls
-------------------------------
√ proxy-injector webhook has valid cert
√ proxy-injector cert is valid for at least 60 days
√ sp-validator webhook has valid cert
√ sp-validator cert is valid for at least 60 days
√ policy-validator webhook has valid cert
√ policy-validator cert is valid for at least 60 days

linkerd-version
---------------
√ can determine the latest version
‼ cli is up-to-date
    is running version 25.2.2 but the latest edge version is 25.3.3
    see https://linkerd.io/2/checks/#l5d-version-cli for hints

control-plane-version
---------------------
√ can retrieve the control plane version
‼ control plane is up-to-date
    is running version 25.2.1 but the latest edge version is 25.3.3
    see https://linkerd.io/2/checks/#l5d-version-control for hints
‼ control plane and cli versions match
    control plane running edge-25.2.1 but cli running edge-25.2.2
    see https://linkerd.io/2/checks/#l5d-version-control for hints

linkerd-control-plane-proxy
---------------------------
√ control plane proxies are healthy
‼ control plane proxies are up-to-date
    some proxies are not running the current version:
        * linkerd-destination-788574cf47-54kmc (edge-25.2.1)
        * linkerd-destination-788574cf47-djgln (edge-25.2.1)
        * linkerd-destination-788574cf47-fvr27 (edge-25.2.1)
        * linkerd-identity-6d9b469976-79n5r (edge-25.2.1)
        * linkerd-identity-6d9b469976-8kmz8 (edge-25.2.1)
        * linkerd-identity-6d9b469976-qjkbf (edge-25.2.1)
        * linkerd-proxy-injector-c46cd9cf5-5gvrq (edge-25.2.1)
        * linkerd-proxy-injector-c46cd9cf5-85ph4 (edge-25.2.1)
        * linkerd-proxy-injector-c46cd9cf5-xd7t2 (edge-25.2.1)
    see https://linkerd.io/2/checks/#l5d-cp-proxy-version for hints
‼ control plane proxies and cli versions match
    linkerd-destination-788574cf47-54kmc running edge-25.2.1 but cli running edge-25.2.2
    see https://linkerd.io/2/checks/#l5d-cp-proxy-cli-version for hints

linkerd-ha-checks
-----------------
√ multiple replicas of control plane pods

linkerd-extension-checks
------------------------
√ namespace configuration for extensions

linkerd-viz
-----------
√ linkerd-viz Namespace exists
√ can initialize the client
√ linkerd-viz ClusterRoles exist
√ linkerd-viz ClusterRoleBindings exist
√ tap API server has valid cert
√ tap API server cert is valid for at least 60 days
√ tap API service is running
‼ linkerd-viz pods are injected
    could not find proxy container for metrics-api-5cfcb4dd46-xn22l pod
    see https://linkerd.io/2/checks/#l5d-viz-pods-injection for hints
‼ viz extension pods are running
    container "linkerd-proxy" in pod "metrics-api-5cfcb4dd46-xn22l" is not ready
    see https://linkerd.io/2/checks/#l5d-viz-pods-running for hints
‼ viz extension proxies are healthy
    no "linkerd-proxy" containers found in the "linkerd" namespace
    see https://linkerd.io/2/checks/#l5d-viz-proxy-healthy for hints
√ viz extension proxies are up-to-date
√ viz extension proxies and cli versions match
√ viz extension self-check

linkerd-smi
-----------
‼ Linkerd extension command linkerd-smi exists
    exec: "linkerd-smi": executable file not found in $PATH
    see https://linkerd.io/2/checks/#extensions for hints


Environment

Kubernetes Version: v1.28.15+rke2r1 Cluster Env: Rancher rke2 Hos Os: Oracle Linux Server 8.9 Linkerd Version: edge-25.2.1

Possible solution

No response

Additional context

Image

Would you like to work on fixing this bug?

no

jaswanth9522 avatar Mar 22 '25 22:03 jaswanth9522

Running into this on enterprise-2.17.0 as well

{"timestamp":"2025-04-04T13:50:20.779634Z","level":"WARN","fields":{"message":"read header from client timeout"},"target":"hyper::proto::h1::io"}                                  
{"timestamp":"2025-04-04T13:50:29.513913Z","level":"WARN","fields":{"message":"read header from client timeout"},"target":"hyper::proto::h1::io"}                                  
{"timestamp":"2025-04-04T13:50:40.340148Z","level":"WARN","fields":{"message":"read header from client timeout"},"target":"hyper::proto::h1::io"}                                  

FredrikAugust avatar Apr 04 '25 14:04 FredrikAugust

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Jul 03 '25 20:07 stale[bot]