Service failures for a minute during linkerd version upgrade
What is the issue?
Problem Description
During Linkerd version upgrades, approximately 10% of requests fail for about 1 minute. This results in service disruption during the upgrade process.
Question
Is there a way to perform zero-downtime Linkerd version upgrades? We're looking for best practices to avoid service interruption during control plane updates.
Expected Outcome
Seeking guidance on upgrading Linkerd without service disruption, or alternative approaches to zero downtime during version upgrades.
How can it be reproduced?
- Upgrade helm chart using ArgoCD
- From: https://helm.linkerd.io/stable 1.12.5
- To: https://helm.linkerd.io/edge 2025.5.1
Logs, error output, etc
###Logs Please find the attached log file showing the connection failures that occur during the upgrade process.
Defaulted container "linkerd-proxy" out of: linkerd-proxy, httpd, log-access, log-core, exporter, sc-exporter, init-service-directory (init), render-http-conf (init), linguist2-package-download (init), linkerd-init (init)
[ 0.001728s] INFO ThreadId(01) linkerd2_proxy::rt: Using multi-threaded proxy runtime cores=4
[ 0.002565s] INFO ThreadId(01) linkerd2_proxy: Admin interface on 0.0.0.0:4191
[ 0.002571s] INFO ThreadId(01) linkerd2_proxy: Inbound interface on 0.0.0.0:4143
[ 0.002572s] INFO ThreadId(01) linkerd2_proxy: Outbound interface on 127.0.0.1:4140
[ 0.002575s] INFO ThreadId(01) linkerd2_proxy: Tap DISABLED
[ 0.002577s] INFO ThreadId(01) linkerd2_proxy: Local identity is default.clous-jrpark-2502261501.serviceaccount.identity.linkerd.cluster.local
[ 0.002579s] INFO ThreadId(01) linkerd2_proxy: Identity verified via linkerd-identity-headless.linkerd.svc.cluster.local:8080 (linkerd-identity.linkerd.serviceaccount.identity.linkerd.cluster.local)
[ 0.002580s] INFO ThreadId(01) linkerd2_proxy: Destinations resolved via linkerd-dst-headless.linkerd.svc.cluster.local:8086 (linkerd-destination.linkerd.serviceaccount.identity.linkerd.cluster.local)
[ 0.009399s] INFO ThreadId(06) daemon:identity: linkerd_app: Certified identity id=default.clous-jrpark-2502261501.serviceaccount.identity.linkerd.cluster.local
[ 104.693433s] WARN ThreadId(04) outbound:proxy{addr=10.160.90.7:80}:controller{addr=linkerd-policy.linkerd.svc.cluster.local:8090}:endpoint{addr=172.24.102.175:8090}: linkerd_reconnect: Service failed error=endpoint 172.24.102.175:8090: channel closed error.sources=[channel closed]
[ 125.366861s] WARN ThreadId(02) watch{port=4191}:controller{addr=linkerd-policy.linkerd.svc.cluster.local:8090}:endpoint{addr=172.24.4.136:8090}: linkerd_reconnect: Service failed error=endpoint 172.24.4.136:8090: channel closed error.sources=[channel closed]
[ 125.527301s] WARN ThreadId(03) outbound:proxy{addr=172.23.205.50:10002}:service{ns= name=service port=0}:controller{addr=linkerd-dst-headless.linkerd.svc.cluster.local:8086}:endpoint{addr=172.24.4.136:8086}: linkerd_reconnect: Service failed error=endpoint 172.24.4.136:8086: channel closed error.sources=[channel closed]
[ 125.663275s] WARN ThreadId(03) watch{port=4191}:controller{addr=linkerd-policy.linkerd.svc.cluster.local:8090}:endpoint{addr=172.24.4.136:8090}: linkerd_reconnect: Failed to connect error=endpoint 172.24.4.136:8090: Connection refused (os error 111) error.sources=[Connection refused (os error 111)]
[ 125.877146s] WARN ThreadId(03) watch{port=4191}:controller{addr=linkerd-policy.linkerd.svc.cluster.local:8090}:endpoint{addr=172.24.89.9:8090}: linkerd_reconnect: Service failed error=endpoint 172.24.89.9:8090: channel closed error.sources=[channel closed]
[ 125.885136s] WARN ThreadId(05) watch{port=4191}:controller{addr=linkerd-policy.linkerd.svc.cluster.local:8090}:endpoint{addr=172.24.4.136:8090}: linkerd_reconnect: Failed to connect error=endpoint 172.24.4.136:8090: Connection refused (os error 111) error.sources=[Connection refused (os error 111)]
[ 125.983845s] WARN ThreadId(04) watch{port=4191}:controller{addr=linkerd-policy.linkerd.svc.cluster.local:8090}:endpoint{addr=172.24.89.9:8090}: linkerd_reconnect: Failed to connect error=endpoint 172.24.89.9:8090: Connection refused (os error 111) error.sources=[Connection refused (os error 111)]
[ 126.084983s] WARN ThreadId(03) outbound:proxy{addr=172.23.205.50:10002}:controller{addr=linkerd-dst-headless.linkerd.svc.cluster.local:8086}:endpoint{addr=172.24.89.9:8086}: linkerd_reconnect: Service failed error=endpoint 172.24.89.9:8086: channel closed error.sources=[channel closed]
[ 126.085002s] WARN ThreadId(03) outbound:proxy{addr=172.23.205.50:10002}:service{ns= name=service port=0}:controller{addr=linkerd-dst-headless.linkerd.svc.cluster.local:8086}:endpoint{addr=172.24.4.136:8086}: linkerd_reconnect: Failed to connect error=endpoint 172.24.4.136:8086: Connection refused (os error 111) error.sources=[Connection refused (os error 111)]
[ 126.191964s] WARN ThreadId(03) outbound:proxy{addr=172.23.205.50:10002}:controller{addr=linkerd-dst-headless.linkerd.svc.cluster.local:8086}:endpoint{addr=172.24.89.9:8086}: linkerd_reconnect: Failed to connect error=endpoint 172.24.89.9:8086: Connection refused (os error 111) error.sources=[Connection refused (os error 111)]
[ 126.194202s] WARN ThreadId(03) watch{port=4191}:controller{addr=linkerd-policy.linkerd.svc.cluster.local:8090}:endpoint{addr=172.24.89.9:8090}: linkerd_reconnect: Failed to connect error=endpoint 172.24.89.9:8090: Connection refused (os error 111) error.sources=[Connection refused (os error 111)]
[ 126.292891s] WARN ThreadId(04) watch{port=4191}:controller{addr=linkerd-policy.linkerd.svc.cluster.local:8090}:endpoint{addr=172.24.4.136:8090}: linkerd_reconnect: Failed to connect error=endpoint 172.24.4.136:8090: Connection refused (os error 111) error.sources=[Connection refused (os error 111)]
[ 126.303854s] WARN ThreadId(04) outbound:proxy{addr=172.23.205.50:10002}:service{ns= name=service port=0}:controller{addr=linkerd-dst-headless.linkerd.svc.cluster.local:8086}:endpoint{addr=172.24.4.136:8086}: linkerd_reconnect: Failed to connect error=endpoint 172.24.4.136:8086: Connection refused (os error 111) error.sources=[Connection refused (os error 111)]
[ 126.403166s] WARN ThreadId(02) outbound:proxy{addr=172.23.205.50:10002}:controller{addr=linkerd-dst-headless.linkerd.svc.cluster.local:8086}:endpoint{addr=172.24.89.9:8086}: linkerd_reconnect: Failed to connect error=endpoint 172.24.89.9:8086: Connection refused (os error 111) error.sources=[Connection refused (os error 111)]
[ 126.596980s] WARN ThreadId(05) watch{port=4191}:controller{addr=linkerd-policy.linkerd.svc.cluster.local:8090}:endpoint{addr=172.24.89.9:8090}: linkerd_reconnect: Failed to connect error=endpoint 172.24.89.9:8090: Connection refused (os error 111) error.sources=[Connection refused (os error 111)]
[ 126.709651s] WARN ThreadId(03) outbound:proxy{addr=172.23.205.50:10002}:service{ns= name=service port=0}:controller{addr=linkerd-dst-headless.linkerd.svc.cluster.local:8086}:endpoint{addr=172.24.4.136:8086}: linkerd_reconnect: Failed to connect error=endpoint 172.24.4.136:8086: Connection refused (os error 111) error.sources=[Connection refused (os error 111)]
[ 126.795145s] WARN ThreadId(04) watch{port=4191}:controller{addr=linkerd-policy.linkerd.svc.cluster.local:8090}:endpoint{addr=172.24.4.136:8090}: linkerd_reconnect: Failed to connect error=endpoint 172.24.4.136:8090: Connection refused (os error 111) error.sources=[Connection refused (os error 111)]
[ 126.837020s] WARN ThreadId(04) outbound:proxy{addr=172.23.205.50:10002}:controller{addr=linkerd-dst-headless.linkerd.svc.cluster.local:8086}:endpoint{addr=172.24.89.9:8086}: linkerd_reconnect: Failed to connect error=endpoint 172.24.89.9:8086: Connection refused (os error 111) error.sources=[Connection refused (os error 111)]
[ 127.210600s] WARN ThreadId(02) outbound:proxy{addr=172.23.205.50:10002}:service{ns= name=service port=0}:controller{addr=linkerd-dst-headless.linkerd.svc.cluster.local:8086}:endpoint{addr=172.24.4.136:8086}: linkerd_reconnect: Failed to connect error=endpoint 172.24.4.136:8086: Connection refused (os error 111) error.sources=[Connection refused (os error 111)]
[ 127.338940s] WARN ThreadId(02) outbound:proxy{addr=172.23.205.50:10002}:controller{addr=linkerd-dst-headless.linkerd.svc.cluster.local:8086}:endpoint{addr=172.24.89.9:8086}: linkerd_reconnect: Failed to connect error=endpoint 172.24.89.9:8086: Connection refused (os error 111) error.sources=[Connection refused (os error 111)]
[ 127.711545s] WARN ThreadId(04) outbound:proxy{addr=172.23.205.50:10002}:service{ns= name=service port=0}:controller{addr=linkerd-dst-headless.linkerd.svc.cluster.local:8086}:endpoint{addr=172.24.4.136:8086}: linkerd_reconnect: Failed to connect error=endpoint 172.24.4.136:8086: Connection refused (os error 111) error.sources=[Connection refused (os error 111)]
[ 127.840790s] WARN ThreadId(04) outbound:proxy{addr=172.23.205.50:10002}:controller{addr=linkerd-dst-headless.linkerd.svc.cluster.local:8086}:endpoint{addr=172.24.89.9:8086}: linkerd_reconnect: Failed to connect error=endpoint 172.24.89.9:8086: Connection refused (os error 111) error.sources=[Connection refused (os error 111)]
[ 128.212686s] WARN ThreadId(02) outbound:proxy{addr=172.23.205.50:10002}:service{ns= name=service port=0}:controller{addr=linkerd-dst-headless.linkerd.svc.cluster.local:8086}:endpoint{addr=172.24.4.136:8086}: linkerd_reconnect: Failed to connect error=endpoint 172.24.4.136:8086: Connection refused (os error 111) error.sources=[Connection refused (os error 111)]
[ 128.713774s] WARN ThreadId(02) outbound:proxy{addr=172.23.205.50:10002}:service{ns= name=service port=0}:controller{addr=linkerd-dst-headless.linkerd.svc.cluster.local:8086}:endpoint{addr=172.24.4.136:8086}: linkerd_reconnect: Failed to connect error=endpoint 172.24.4.136:8086: Connection refused (os error 111) error.sources=[Connection refused (os error 111)]
[ 129.086052s] WARN ThreadId(03) outbound:proxy{addr=172.23.205.50:10002}:controller{addr=linkerd-dst-headless.linkerd.svc.cluster.local:8086}: linkerd_stack::failfast: Service entering failfast after 3s
[ 129.343616s] WARN ThreadId(02) outbound:proxy{addr=172.23.205.50:10002}:controller{addr=linkerd-dst-headless.linkerd.svc.cluster.local:8086}:endpoint{addr=172.24.89.9:8086}: linkerd_reconnect: Failed to connect error=endpoint 172.24.89.9:8086: connect timed out after 1s error.sources=[connect timed out after 1s]
[ 129.562397s] INFO ThreadId(02) inbound: linkerd_app_core::serve: Connection closed error=connection closed before message completed client.addr=10.160.227.228:50850
[ 130.042473s] INFO ThreadId(02) linkerd_stack::failfast: Service has recovered
[ 130.113149s] WARN ThreadId(03) outbound:proxy{addr=172.23.205.50:10002}: linkerd_stack::failfast: Service entering failfast after 3s
[ 130.113232s] INFO ThreadId(02) outbound:proxy{addr=172.23.205.50:10002}:rescue{client.addr=172.24.107.26:53856}: linkerd_app_core::errors::respond: gRPC request failed error=logical service 172.23.205.50:10002: service in fail-fast error.sources=[service in fail-fast]
[ 130.113229s] INFO ThreadId(04) outbound:proxy{addr=172.23.205.50:10002}:rescue{client.addr=172.24.107.26:53852}: linkerd_app_core::errors::respond: gRPC request failed error=logical service 172.23.205.50:10002: service in fail-fast error.sources=[service in fail-fast]
[ 130.113237s] INFO ThreadId(05) outbound:proxy{addr=172.23.205.50:10002}:rescue{client.addr=172.24.107.26:53860}: linkerd_app_core::errors::respond: gRPC request failed error=logical service 172.23.205.50:10002: service in fail-fast error.sources=[service in fail-fast]
[ 130.113237s] INFO ThreadId(03) outbound:proxy{addr=172.23.205.50:10002}:rescue{client.addr=172.24.107.26:53862}: linkerd_app_core::errors::respond: gRPC request failed error=logical service 172.23.205.50:10002: service in fail-fast error.sources=[service in fail-fast]
[ 130.113361s] INFO ThreadId(02) outbound:proxy{addr=172.23.205.50:10002}:rescue{client.addr=172.24.107.26:53858}: linkerd_app_core::errors::respond: gRPC request failed error=logical service 172.23.205.50:10002: service in fail-fast error.sources=[service in fail-fast]
[ 130.113364s] INFO ThreadId(04) outbound:proxy{addr=172.23.205.50:10002}:rescue{client.addr=172.24.107.26:53866}: linkerd_app_core::errors::respond: gRPC request failed error=logical service 172.23.205.50:10002: service in fail-fast error.sources=[service in fail-fast]
[ 130.113435s] INFO ThreadId(03) outbound:proxy{addr=172.23.205.50:10002}:rescue{client.addr=172.24.107.26:53864}: linkerd_app_core::errors::respond: gRPC request failed error=logical service 172.23.205.50:10002: service in fail-fast error.sources=[service in fail-fast]
[ 130.113988s] INFO ThreadId(02) outbound:proxy{addr=172.23.205.50:10002}:rescue{client.addr=172.24.107.26:53858}: linkerd_app_core::errors::respond: gRPC request failed error=logical service 172.23.205.50:10002: service unavailable error.sources=[service unavailable]
[ 134.846948s] INFO ThreadId(04) outbound:proxy{addr=172.23.205.50:10002}:rescue{client.addr=172.24.107.26:53854}: linkerd_app_core::errors::respond: gRPC request failed error=logical service 172.23.205.50:10002: service unavailable error.sources=[service unavailable]
[ 135.352985s] INFO ThreadId(03) outbound:proxy{addr=172.23.205.50:10002}:rescue{client.addr=172.24.107.26:53896}: linkerd_app_core::errors::respond: gRPC request failed error=logical service 172.23.205.50:10002: service unavailable error.sources=[service unavailable]
[ 135.356039s] INFO ThreadId(02) outbound:proxy{addr=172.23.205.50:10002}:rescue{client.addr=172.24.107.26:53904}: linkerd_app_core::errors::respond: gRPC request failed error=logical service 172.23.205.50:10002: service unavailable error.sources=[service unavailable]
[ 135.356247s] INFO ThreadId(02) outbound:proxy{addr=172.23.205.50:10002}:rescue{client.addr=172.24.107.26:53904}: linkerd_app_core::errors::respond: gRPC request failed error=logical service 172.23.205.50:10002: service unavailable error.sources=[service unavailable]
[ 135.435651s] INFO ThreadId(04) outbound:proxy{addr=172.23.205.50:10002}:rescue{client.addr=172.24.107.26:53892}: linkerd_app_core::errors::respond: gRPC request failed error=logical service 172.23.205.50:10002: service unavailable error.sources=[service unavailable]
[ 135.462971s] INFO ThreadId(02) outbound:proxy{addr=172.23.205.50:10002}:rescue{client.addr=172.24.107.26:53906}: linkerd_app_core::errors::respond: gRPC request failed error=logical service 172.23.205.50:10002: service unavailable error.sources=[service unavailable]
[ 135.516011s] INFO ThreadId(05) outbound:proxy{addr=172.23.205.50:10002}:rescue{client.addr=172.24.107.26:53902}: linkerd_app_core::errors::respond: gRPC request failed error=logical service 172.23.205.50:10002: service unavailable error.sources=[service unavailable]
[ 135.814020s] INFO ThreadId(02) outbound:proxy{addr=172.23.205.50:10002}:rescue{client.addr=172.24.107.26:53898}: linkerd_app_core::errors::respond: gRPC request failed error=logical service 172.23.205.50:10002: service unavailable error.sources=[service unavailable]
[ 135.814296s] INFO ThreadId(02) outbound:proxy{addr=172.23.205.50:10002}:rescue{client.addr=172.24.107.26:53898}: linkerd_app_core::errors::respond: gRPC request failed error=logical service 172.23.205.50:10002: service unavailable error.sources=[service unavailable]
[ 136.244542s] INFO ThreadId(03) linkerd_stack::failfast: Service has recovered
output of linkerd check -o short
Before update
linkerd check -o short
linkerd-identity
----------------
‼ issuer cert is valid for at least 60 days
issuer certificate will expire on 2025-05-13T16:13:25Z
see https://linkerd.io/2/checks/#l5d-identity-issuer-cert-not-expiring-soon for hints
linkerd-version
---------------
‼ cli is up-to-date
is running version 25.5.1 but the latest edge version is 25.5.2
see https://linkerd.io/2/checks/#l5d-version-cli for hints
control-plane-version
---------------------
‼ control plane is up-to-date
unsupported version channel: stable-2.13.5
see https://linkerd.io/2/checks/#l5d-version-control for hints
‼ control plane and cli versions match
control plane running stable-2.13.5 but cli running edge-25.5.1
see https://linkerd.io/2/checks/#l5d-version-control for hints
linkerd-control-plane-proxy
---------------------------
‼ control plane proxies are up-to-date
some proxies are not running the current version:
* linkerd-destination-58ddfc7dd6-5v6vt (stable-2.13.5)
* linkerd-destination-58ddfc7dd6-97bkl (stable-2.13.5)
* linkerd-destination-58ddfc7dd6-jhrsz (stable-2.13.5)
* linkerd-identity-6fb7fbf94f-4xq5x (stable-2.13.5)
* linkerd-identity-6fb7fbf94f-8k7mm (stable-2.13.5)
* linkerd-identity-6fb7fbf94f-nfqnh (stable-2.13.5)
* linkerd-proxy-injector-766d6d4f6f-bp2cs (stable-2.13.5)
* linkerd-proxy-injector-766d6d4f6f-cqmzh (stable-2.13.5)
* linkerd-proxy-injector-766d6d4f6f-z7h7w (stable-2.13.5)
see https://linkerd.io/2/checks/#l5d-cp-proxy-version for hints
‼ control plane proxies and cli versions match
linkerd-destination-58ddfc7dd6-5v6vt running stable-2.13.5 but cli running edge-25.5.1
see https://linkerd.io/2/checks/#l5d-cp-proxy-cli-version for hints
Status check results are √
After update
linkerd check -o short
linkerd-identity
----------------
‼ issuer cert is valid for at least 60 days
issuer certificate will expire on 2025-05-13T16:13:25Z
see https://linkerd.io/2/checks/#l5d-identity-issuer-cert-not-expiring-soon for hints
linkerd-version
---------------
‼ cli is up-to-date
is running version 25.5.1 but the latest edge version is 25.5.2
see https://linkerd.io/2/checks/#l5d-version-cli for hints
control-plane-version
---------------------
‼ control plane is up-to-date
is running version 25.5.1 but the latest edge version is 25.5.2
see https://linkerd.io/2/checks/#l5d-version-control for hints
linkerd-control-plane-proxy
---------------------------
‼ control plane proxies are up-to-date
some proxies are not running the current version:
* linkerd-destination-59ccc7ffcf-2bwbw (edge-25.5.1)
* linkerd-destination-59ccc7ffcf-cc97n (edge-25.5.1)
* linkerd-destination-59ccc7ffcf-p5z7p (edge-25.5.1)
* linkerd-identity-58786846cb-2545s (edge-25.5.1)
* linkerd-identity-58786846cb-74stx (edge-25.5.1)
* linkerd-identity-58786846cb-m87q8 (edge-25.5.1)
* linkerd-proxy-injector-6c5c9cb847-pf9jc (edge-25.5.1)
* linkerd-proxy-injector-6c5c9cb847-qbqhx (edge-25.5.1)
* linkerd-proxy-injector-6c5c9cb847-r4l2g (edge-25.5.1)
see https://linkerd.io/2/checks/#l5d-cp-proxy-version for hints
Status check results are √
Environment
- kubernetes version: v1.23.15
- Hosted k8s cluster
Possible solution
No response
Additional context
No response
Would you like to work on fixing this bug?
None
Thanks for reporting this @parkjeongryul. As I can see you are upgrading from stable-2.13.5 to one of our latest edges. This upgrade path is not supported and therefore there is no guarantees around zero downtime upgrades.
You can refer to our docs to get more clarity on supported upgrade paths: https://linkerd.io/2-edge/tasks/upgrade/
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.