linkerd2 icon indicating copy to clipboard operation
linkerd2 copied to clipboard

Service failures for a minute during linkerd version upgrade

Open parkjeongryul opened this issue 7 months ago • 1 comments

What is the issue?

Problem Description

During Linkerd version upgrades, approximately 10% of requests fail for about 1 minute. This results in service disruption during the upgrade process.

Question

Is there a way to perform zero-downtime Linkerd version upgrades? We're looking for best practices to avoid service interruption during control plane updates.

Image

Expected Outcome

Seeking guidance on upgrading Linkerd without service disruption, or alternative approaches to zero downtime during version upgrades.

How can it be reproduced?

  • Upgrade helm chart using ArgoCD
    • From: https://helm.linkerd.io/stable 1.12.5
    • To: https://helm.linkerd.io/edge 2025.5.1

Logs, error output, etc

###Logs Please find the attached log file showing the connection failures that occur during the upgrade process.

Defaulted container "linkerd-proxy" out of: linkerd-proxy, httpd, log-access, log-core, exporter, sc-exporter, init-service-directory (init), render-http-conf (init), linguist2-package-download (init), linkerd-init (init)
[     0.001728s]  INFO ThreadId(01) linkerd2_proxy::rt: Using multi-threaded proxy runtime cores=4
[     0.002565s]  INFO ThreadId(01) linkerd2_proxy: Admin interface on 0.0.0.0:4191
[     0.002571s]  INFO ThreadId(01) linkerd2_proxy: Inbound interface on 0.0.0.0:4143
[     0.002572s]  INFO ThreadId(01) linkerd2_proxy: Outbound interface on 127.0.0.1:4140
[     0.002575s]  INFO ThreadId(01) linkerd2_proxy: Tap DISABLED
[     0.002577s]  INFO ThreadId(01) linkerd2_proxy: Local identity is default.clous-jrpark-2502261501.serviceaccount.identity.linkerd.cluster.local
[     0.002579s]  INFO ThreadId(01) linkerd2_proxy: Identity verified via linkerd-identity-headless.linkerd.svc.cluster.local:8080 (linkerd-identity.linkerd.serviceaccount.identity.linkerd.cluster.local)
[     0.002580s]  INFO ThreadId(01) linkerd2_proxy: Destinations resolved via linkerd-dst-headless.linkerd.svc.cluster.local:8086 (linkerd-destination.linkerd.serviceaccount.identity.linkerd.cluster.local)
[     0.009399s]  INFO ThreadId(06) daemon:identity: linkerd_app: Certified identity id=default.clous-jrpark-2502261501.serviceaccount.identity.linkerd.cluster.local
[   104.693433s]  WARN ThreadId(04) outbound:proxy{addr=10.160.90.7:80}:controller{addr=linkerd-policy.linkerd.svc.cluster.local:8090}:endpoint{addr=172.24.102.175:8090}: linkerd_reconnect: Service failed error=endpoint 172.24.102.175:8090: channel closed error.sources=[channel closed]
[   125.366861s]  WARN ThreadId(02) watch{port=4191}:controller{addr=linkerd-policy.linkerd.svc.cluster.local:8090}:endpoint{addr=172.24.4.136:8090}: linkerd_reconnect: Service failed error=endpoint 172.24.4.136:8090: channel closed error.sources=[channel closed]
[   125.527301s]  WARN ThreadId(03) outbound:proxy{addr=172.23.205.50:10002}:service{ns= name=service port=0}:controller{addr=linkerd-dst-headless.linkerd.svc.cluster.local:8086}:endpoint{addr=172.24.4.136:8086}: linkerd_reconnect: Service failed error=endpoint 172.24.4.136:8086: channel closed error.sources=[channel closed]
[   125.663275s]  WARN ThreadId(03) watch{port=4191}:controller{addr=linkerd-policy.linkerd.svc.cluster.local:8090}:endpoint{addr=172.24.4.136:8090}: linkerd_reconnect: Failed to connect error=endpoint 172.24.4.136:8090: Connection refused (os error 111) error.sources=[Connection refused (os error 111)]
[   125.877146s]  WARN ThreadId(03) watch{port=4191}:controller{addr=linkerd-policy.linkerd.svc.cluster.local:8090}:endpoint{addr=172.24.89.9:8090}: linkerd_reconnect: Service failed error=endpoint 172.24.89.9:8090: channel closed error.sources=[channel closed]
[   125.885136s]  WARN ThreadId(05) watch{port=4191}:controller{addr=linkerd-policy.linkerd.svc.cluster.local:8090}:endpoint{addr=172.24.4.136:8090}: linkerd_reconnect: Failed to connect error=endpoint 172.24.4.136:8090: Connection refused (os error 111) error.sources=[Connection refused (os error 111)]
[   125.983845s]  WARN ThreadId(04) watch{port=4191}:controller{addr=linkerd-policy.linkerd.svc.cluster.local:8090}:endpoint{addr=172.24.89.9:8090}: linkerd_reconnect: Failed to connect error=endpoint 172.24.89.9:8090: Connection refused (os error 111) error.sources=[Connection refused (os error 111)]
[   126.084983s]  WARN ThreadId(03) outbound:proxy{addr=172.23.205.50:10002}:controller{addr=linkerd-dst-headless.linkerd.svc.cluster.local:8086}:endpoint{addr=172.24.89.9:8086}: linkerd_reconnect: Service failed error=endpoint 172.24.89.9:8086: channel closed error.sources=[channel closed]
[   126.085002s]  WARN ThreadId(03) outbound:proxy{addr=172.23.205.50:10002}:service{ns= name=service port=0}:controller{addr=linkerd-dst-headless.linkerd.svc.cluster.local:8086}:endpoint{addr=172.24.4.136:8086}: linkerd_reconnect: Failed to connect error=endpoint 172.24.4.136:8086: Connection refused (os error 111) error.sources=[Connection refused (os error 111)]
[   126.191964s]  WARN ThreadId(03) outbound:proxy{addr=172.23.205.50:10002}:controller{addr=linkerd-dst-headless.linkerd.svc.cluster.local:8086}:endpoint{addr=172.24.89.9:8086}: linkerd_reconnect: Failed to connect error=endpoint 172.24.89.9:8086: Connection refused (os error 111) error.sources=[Connection refused (os error 111)]
[   126.194202s]  WARN ThreadId(03) watch{port=4191}:controller{addr=linkerd-policy.linkerd.svc.cluster.local:8090}:endpoint{addr=172.24.89.9:8090}: linkerd_reconnect: Failed to connect error=endpoint 172.24.89.9:8090: Connection refused (os error 111) error.sources=[Connection refused (os error 111)]
[   126.292891s]  WARN ThreadId(04) watch{port=4191}:controller{addr=linkerd-policy.linkerd.svc.cluster.local:8090}:endpoint{addr=172.24.4.136:8090}: linkerd_reconnect: Failed to connect error=endpoint 172.24.4.136:8090: Connection refused (os error 111) error.sources=[Connection refused (os error 111)]
[   126.303854s]  WARN ThreadId(04) outbound:proxy{addr=172.23.205.50:10002}:service{ns= name=service port=0}:controller{addr=linkerd-dst-headless.linkerd.svc.cluster.local:8086}:endpoint{addr=172.24.4.136:8086}: linkerd_reconnect: Failed to connect error=endpoint 172.24.4.136:8086: Connection refused (os error 111) error.sources=[Connection refused (os error 111)]
[   126.403166s]  WARN ThreadId(02) outbound:proxy{addr=172.23.205.50:10002}:controller{addr=linkerd-dst-headless.linkerd.svc.cluster.local:8086}:endpoint{addr=172.24.89.9:8086}: linkerd_reconnect: Failed to connect error=endpoint 172.24.89.9:8086: Connection refused (os error 111) error.sources=[Connection refused (os error 111)]
[   126.596980s]  WARN ThreadId(05) watch{port=4191}:controller{addr=linkerd-policy.linkerd.svc.cluster.local:8090}:endpoint{addr=172.24.89.9:8090}: linkerd_reconnect: Failed to connect error=endpoint 172.24.89.9:8090: Connection refused (os error 111) error.sources=[Connection refused (os error 111)]
[   126.709651s]  WARN ThreadId(03) outbound:proxy{addr=172.23.205.50:10002}:service{ns= name=service port=0}:controller{addr=linkerd-dst-headless.linkerd.svc.cluster.local:8086}:endpoint{addr=172.24.4.136:8086}: linkerd_reconnect: Failed to connect error=endpoint 172.24.4.136:8086: Connection refused (os error 111) error.sources=[Connection refused (os error 111)]
[   126.795145s]  WARN ThreadId(04) watch{port=4191}:controller{addr=linkerd-policy.linkerd.svc.cluster.local:8090}:endpoint{addr=172.24.4.136:8090}: linkerd_reconnect: Failed to connect error=endpoint 172.24.4.136:8090: Connection refused (os error 111) error.sources=[Connection refused (os error 111)]
[   126.837020s]  WARN ThreadId(04) outbound:proxy{addr=172.23.205.50:10002}:controller{addr=linkerd-dst-headless.linkerd.svc.cluster.local:8086}:endpoint{addr=172.24.89.9:8086}: linkerd_reconnect: Failed to connect error=endpoint 172.24.89.9:8086: Connection refused (os error 111) error.sources=[Connection refused (os error 111)]
[   127.210600s]  WARN ThreadId(02) outbound:proxy{addr=172.23.205.50:10002}:service{ns= name=service port=0}:controller{addr=linkerd-dst-headless.linkerd.svc.cluster.local:8086}:endpoint{addr=172.24.4.136:8086}: linkerd_reconnect: Failed to connect error=endpoint 172.24.4.136:8086: Connection refused (os error 111) error.sources=[Connection refused (os error 111)]
[   127.338940s]  WARN ThreadId(02) outbound:proxy{addr=172.23.205.50:10002}:controller{addr=linkerd-dst-headless.linkerd.svc.cluster.local:8086}:endpoint{addr=172.24.89.9:8086}: linkerd_reconnect: Failed to connect error=endpoint 172.24.89.9:8086: Connection refused (os error 111) error.sources=[Connection refused (os error 111)]
[   127.711545s]  WARN ThreadId(04) outbound:proxy{addr=172.23.205.50:10002}:service{ns= name=service port=0}:controller{addr=linkerd-dst-headless.linkerd.svc.cluster.local:8086}:endpoint{addr=172.24.4.136:8086}: linkerd_reconnect: Failed to connect error=endpoint 172.24.4.136:8086: Connection refused (os error 111) error.sources=[Connection refused (os error 111)]
[   127.840790s]  WARN ThreadId(04) outbound:proxy{addr=172.23.205.50:10002}:controller{addr=linkerd-dst-headless.linkerd.svc.cluster.local:8086}:endpoint{addr=172.24.89.9:8086}: linkerd_reconnect: Failed to connect error=endpoint 172.24.89.9:8086: Connection refused (os error 111) error.sources=[Connection refused (os error 111)]
[   128.212686s]  WARN ThreadId(02) outbound:proxy{addr=172.23.205.50:10002}:service{ns= name=service port=0}:controller{addr=linkerd-dst-headless.linkerd.svc.cluster.local:8086}:endpoint{addr=172.24.4.136:8086}: linkerd_reconnect: Failed to connect error=endpoint 172.24.4.136:8086: Connection refused (os error 111) error.sources=[Connection refused (os error 111)]
[   128.713774s]  WARN ThreadId(02) outbound:proxy{addr=172.23.205.50:10002}:service{ns= name=service port=0}:controller{addr=linkerd-dst-headless.linkerd.svc.cluster.local:8086}:endpoint{addr=172.24.4.136:8086}: linkerd_reconnect: Failed to connect error=endpoint 172.24.4.136:8086: Connection refused (os error 111) error.sources=[Connection refused (os error 111)]
[   129.086052s]  WARN ThreadId(03) outbound:proxy{addr=172.23.205.50:10002}:controller{addr=linkerd-dst-headless.linkerd.svc.cluster.local:8086}: linkerd_stack::failfast: Service entering failfast after 3s
[   129.343616s]  WARN ThreadId(02) outbound:proxy{addr=172.23.205.50:10002}:controller{addr=linkerd-dst-headless.linkerd.svc.cluster.local:8086}:endpoint{addr=172.24.89.9:8086}: linkerd_reconnect: Failed to connect error=endpoint 172.24.89.9:8086: connect timed out after 1s error.sources=[connect timed out after 1s]
[   129.562397s]  INFO ThreadId(02) inbound: linkerd_app_core::serve: Connection closed error=connection closed before message completed client.addr=10.160.227.228:50850
[   130.042473s]  INFO ThreadId(02) linkerd_stack::failfast: Service has recovered
[   130.113149s]  WARN ThreadId(03) outbound:proxy{addr=172.23.205.50:10002}: linkerd_stack::failfast: Service entering failfast after 3s
[   130.113232s]  INFO ThreadId(02) outbound:proxy{addr=172.23.205.50:10002}:rescue{client.addr=172.24.107.26:53856}: linkerd_app_core::errors::respond: gRPC request failed error=logical service 172.23.205.50:10002: service in fail-fast error.sources=[service in fail-fast]
[   130.113229s]  INFO ThreadId(04) outbound:proxy{addr=172.23.205.50:10002}:rescue{client.addr=172.24.107.26:53852}: linkerd_app_core::errors::respond: gRPC request failed error=logical service 172.23.205.50:10002: service in fail-fast error.sources=[service in fail-fast]
[   130.113237s]  INFO ThreadId(05) outbound:proxy{addr=172.23.205.50:10002}:rescue{client.addr=172.24.107.26:53860}: linkerd_app_core::errors::respond: gRPC request failed error=logical service 172.23.205.50:10002: service in fail-fast error.sources=[service in fail-fast]
[   130.113237s]  INFO ThreadId(03) outbound:proxy{addr=172.23.205.50:10002}:rescue{client.addr=172.24.107.26:53862}: linkerd_app_core::errors::respond: gRPC request failed error=logical service 172.23.205.50:10002: service in fail-fast error.sources=[service in fail-fast]
[   130.113361s]  INFO ThreadId(02) outbound:proxy{addr=172.23.205.50:10002}:rescue{client.addr=172.24.107.26:53858}: linkerd_app_core::errors::respond: gRPC request failed error=logical service 172.23.205.50:10002: service in fail-fast error.sources=[service in fail-fast]
[   130.113364s]  INFO ThreadId(04) outbound:proxy{addr=172.23.205.50:10002}:rescue{client.addr=172.24.107.26:53866}: linkerd_app_core::errors::respond: gRPC request failed error=logical service 172.23.205.50:10002: service in fail-fast error.sources=[service in fail-fast]
[   130.113435s]  INFO ThreadId(03) outbound:proxy{addr=172.23.205.50:10002}:rescue{client.addr=172.24.107.26:53864}: linkerd_app_core::errors::respond: gRPC request failed error=logical service 172.23.205.50:10002: service in fail-fast error.sources=[service in fail-fast]
[   130.113988s]  INFO ThreadId(02) outbound:proxy{addr=172.23.205.50:10002}:rescue{client.addr=172.24.107.26:53858}: linkerd_app_core::errors::respond: gRPC request failed error=logical service 172.23.205.50:10002: service unavailable error.sources=[service unavailable]
[   134.846948s]  INFO ThreadId(04) outbound:proxy{addr=172.23.205.50:10002}:rescue{client.addr=172.24.107.26:53854}: linkerd_app_core::errors::respond: gRPC request failed error=logical service 172.23.205.50:10002: service unavailable error.sources=[service unavailable]
[   135.352985s]  INFO ThreadId(03) outbound:proxy{addr=172.23.205.50:10002}:rescue{client.addr=172.24.107.26:53896}: linkerd_app_core::errors::respond: gRPC request failed error=logical service 172.23.205.50:10002: service unavailable error.sources=[service unavailable]
[   135.356039s]  INFO ThreadId(02) outbound:proxy{addr=172.23.205.50:10002}:rescue{client.addr=172.24.107.26:53904}: linkerd_app_core::errors::respond: gRPC request failed error=logical service 172.23.205.50:10002: service unavailable error.sources=[service unavailable]
[   135.356247s]  INFO ThreadId(02) outbound:proxy{addr=172.23.205.50:10002}:rescue{client.addr=172.24.107.26:53904}: linkerd_app_core::errors::respond: gRPC request failed error=logical service 172.23.205.50:10002: service unavailable error.sources=[service unavailable]
[   135.435651s]  INFO ThreadId(04) outbound:proxy{addr=172.23.205.50:10002}:rescue{client.addr=172.24.107.26:53892}: linkerd_app_core::errors::respond: gRPC request failed error=logical service 172.23.205.50:10002: service unavailable error.sources=[service unavailable]
[   135.462971s]  INFO ThreadId(02) outbound:proxy{addr=172.23.205.50:10002}:rescue{client.addr=172.24.107.26:53906}: linkerd_app_core::errors::respond: gRPC request failed error=logical service 172.23.205.50:10002: service unavailable error.sources=[service unavailable]
[   135.516011s]  INFO ThreadId(05) outbound:proxy{addr=172.23.205.50:10002}:rescue{client.addr=172.24.107.26:53902}: linkerd_app_core::errors::respond: gRPC request failed error=logical service 172.23.205.50:10002: service unavailable error.sources=[service unavailable]
[   135.814020s]  INFO ThreadId(02) outbound:proxy{addr=172.23.205.50:10002}:rescue{client.addr=172.24.107.26:53898}: linkerd_app_core::errors::respond: gRPC request failed error=logical service 172.23.205.50:10002: service unavailable error.sources=[service unavailable]
[   135.814296s]  INFO ThreadId(02) outbound:proxy{addr=172.23.205.50:10002}:rescue{client.addr=172.24.107.26:53898}: linkerd_app_core::errors::respond: gRPC request failed error=logical service 172.23.205.50:10002: service unavailable error.sources=[service unavailable]
[   136.244542s]  INFO ThreadId(03) linkerd_stack::failfast: Service has recovered

output of linkerd check -o short

Before update

linkerd check -o short
linkerd-identity
----------------
‼ issuer cert is valid for at least 60 days
    issuer certificate will expire on 2025-05-13T16:13:25Z
    see https://linkerd.io/2/checks/#l5d-identity-issuer-cert-not-expiring-soon for hints

linkerd-version
---------------
‼ cli is up-to-date
    is running version 25.5.1 but the latest edge version is 25.5.2
    see https://linkerd.io/2/checks/#l5d-version-cli for hints

control-plane-version
---------------------
‼ control plane is up-to-date
    unsupported version channel: stable-2.13.5
    see https://linkerd.io/2/checks/#l5d-version-control for hints
‼ control plane and cli versions match
    control plane running stable-2.13.5 but cli running edge-25.5.1
    see https://linkerd.io/2/checks/#l5d-version-control for hints

linkerd-control-plane-proxy
---------------------------
‼ control plane proxies are up-to-date
    some proxies are not running the current version:
	* linkerd-destination-58ddfc7dd6-5v6vt (stable-2.13.5)
	* linkerd-destination-58ddfc7dd6-97bkl (stable-2.13.5)
	* linkerd-destination-58ddfc7dd6-jhrsz (stable-2.13.5)
	* linkerd-identity-6fb7fbf94f-4xq5x (stable-2.13.5)
	* linkerd-identity-6fb7fbf94f-8k7mm (stable-2.13.5)
	* linkerd-identity-6fb7fbf94f-nfqnh (stable-2.13.5)
	* linkerd-proxy-injector-766d6d4f6f-bp2cs (stable-2.13.5)
	* linkerd-proxy-injector-766d6d4f6f-cqmzh (stable-2.13.5)
	* linkerd-proxy-injector-766d6d4f6f-z7h7w (stable-2.13.5)
    see https://linkerd.io/2/checks/#l5d-cp-proxy-version for hints
‼ control plane proxies and cli versions match
    linkerd-destination-58ddfc7dd6-5v6vt running stable-2.13.5 but cli running edge-25.5.1
    see https://linkerd.io/2/checks/#l5d-cp-proxy-cli-version for hints

Status check results are √

After update

linkerd check -o short
linkerd-identity
----------------
‼ issuer cert is valid for at least 60 days
    issuer certificate will expire on 2025-05-13T16:13:25Z
    see https://linkerd.io/2/checks/#l5d-identity-issuer-cert-not-expiring-soon for hints

linkerd-version
---------------
‼ cli is up-to-date
    is running version 25.5.1 but the latest edge version is 25.5.2
    see https://linkerd.io/2/checks/#l5d-version-cli for hints

control-plane-version
---------------------
‼ control plane is up-to-date
    is running version 25.5.1 but the latest edge version is 25.5.2
    see https://linkerd.io/2/checks/#l5d-version-control for hints

linkerd-control-plane-proxy
---------------------------
‼ control plane proxies are up-to-date
    some proxies are not running the current version:
	* linkerd-destination-59ccc7ffcf-2bwbw (edge-25.5.1)
	* linkerd-destination-59ccc7ffcf-cc97n (edge-25.5.1)
	* linkerd-destination-59ccc7ffcf-p5z7p (edge-25.5.1)
	* linkerd-identity-58786846cb-2545s (edge-25.5.1)
	* linkerd-identity-58786846cb-74stx (edge-25.5.1)
	* linkerd-identity-58786846cb-m87q8 (edge-25.5.1)
	* linkerd-proxy-injector-6c5c9cb847-pf9jc (edge-25.5.1)
	* linkerd-proxy-injector-6c5c9cb847-qbqhx (edge-25.5.1)
	* linkerd-proxy-injector-6c5c9cb847-r4l2g (edge-25.5.1)
    see https://linkerd.io/2/checks/#l5d-cp-proxy-version for hints

Status check results are √

Environment

  • kubernetes version: v1.23.15
  • Hosted k8s cluster

Possible solution

No response

Additional context

No response

Would you like to work on fixing this bug?

None

parkjeongryul avatar May 12 '25 10:05 parkjeongryul

Thanks for reporting this @parkjeongryul. As I can see you are upgrading from stable-2.13.5 to one of our latest edges. This upgrade path is not supported and therefore there is no guarantees around zero downtime upgrades.

You can refer to our docs to get more clarity on supported upgrade paths: https://linkerd.io/2-edge/tasks/upgrade/

zaharidichev avatar May 16 '25 16:05 zaharidichev

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Aug 14 '25 23:08 stale[bot]