destination and proxy-injector proxies can't get certified by identity after node startup
What is the issue?
When a new node is starting up, linkerd-destination an linkerd-proxy-injector pods are stuck in a crashloop because linkerd-proxy container is in a failed state. It seems like they can't get certified by linkerd-identity. The issue can be resolved by redeploying linkerd-identity manually, which resolves all issues and everything works fine.
Running in AKS with terraform and helm install, when a new node starts.
How can it be reproduced?
We are running in AKS with automatic node image upgrades, the node is drained and a new node starts up. I was also able to reproduce it with my own personal test cluster, by just stopping and starting the cluster. You can see my setup here.
Logs, error output, etc
From linkerd-destination proxy container
{"timestamp":"[ 149.882911s]","level":"DEBUG","fields":{"message":"Connecting","server.addr":"10.52.0.63:8080"},"target":"linkerd_proxy_transport::connect","spans":[{"name":"identity"},{"server.addr":"linkerd-identity-headless.linkerd.svc.cluster.local:8080","name":"identity"},{"addr":"linkerd-identity-headless.linkerd.svc.cluster.local:8080","name":"controller"},{"addr":"10.52.0.63:8080","name":"endpoint"}],"threadId":"ThreadId(2)"}
{"timestamp":"[ 149.883213s]","level":"DEBUG","fields":{"message":"Connected","local.addr":"10.52.0.17:38572","keepalive":"Some(10s)"},"target":"linkerd_proxy_transport::connect","spans":[{"name":"identity"},{"server.addr":"linkerd-identity-headless.linkerd.svc.cluster.local:8080","name":"identity"},{"addr":"linkerd-identity-headless.linkerd.svc.cluster.local:8080","name":"controller"},{"addr":"10.52.0.63:8080","name":"endpoint"},{"name":"h2"}],"threadId":"ThreadId(2)"}
{"timestamp":"[ 149.883655s]","level":"WARN","fields":{"message":"Failed to connect","error":"received corrupt message"},"target":"linkerd_reconnect","spans":[{"name":"identity"},{"server.addr":"linkerd-identity-headless.linkerd.svc.cluster.local:8080","name":"identity"},{"addr":"linkerd-identity-headless.linkerd.svc.cluster.local:8080","name":"controller"},{"addr":"10.52.0.63:8080","name":"endpoint"}],"threadId":"ThreadId(2)"}
{"timestamp":"[ 149.883681s]","level":"DEBUG","fields":{"message":"Recovering"},"target":"linkerd_reconnect","spans":[{"name":"identity"},{"server.addr":"linkerd-identity-headless.linkerd.svc.cluster.local:8080","name":"identity"},{"addr":"linkerd-identity-headless.linkerd.svc.cluster.local:8080","name":"controller"},{"addr":"10.52.0.63:8080","name":"endpoint"}],"threadId":"ThreadId(2)"}
{"timestamp":"[ 149.883689s]","level":"DEBUG","fields":{"message":"Disconnected","backoff":"true"},"target":"linkerd_reconnect","spans":[{"name":"identity"},{"server.addr":"linkerd-identity-headless.linkerd.svc.cluster.local:8080","name":"identity"},{"addr":"linkerd-identity-headless.linkerd.svc.cluster.local:8080","name":"controller"},{"addr":"10.52.0.63:8080","name":"endpoint"}],"threadId":"ThreadId(2)"}
{"timestamp":"[ 150.013569s]","level":"WARN","fields":{"message":"Waiting for identity to be initialized..."},"target":"linkerd_app","threadId":"ThreadId(1)"}
From linkerd-proxy-injector proxy container
{"timestamp":"[ 149.864874s]","level":"WARN","fields":{"message":"Failed to connect","error":"received corrupt message"},"target":"linkerd_reconnect","spans":[{"name":"identity"},{"addr":"linkerd-identity-headless.linkerd.svc.cluster.local:8080","name":"controller"},{"addr":"10.52.0.63:8080","name":"endpoint"}],"threadId":"ThreadId(2)"}
From linkerd-identity
{"level":"info","msg":"running version stable-2.11.4","time":"2022-07-24T11:54:31Z"}
{"level":"info","msg":"starting admin server on :9990","time":"2022-07-24T11:54:31Z"}
{"level":"info","msg":"starting gRPC server on :8080","time":"2022-07-24T11:54:31Z"}
{"level":"info","msg":"issued certificate for linkerd-identity.linkerd.serviceaccount.identity.linkerd.cluster.local until 2022-07-25 11:56:14 +0000 UTC: 5ba00f0d3e1b2f5fd47ab1780601e38a","time":"2022-07-24T11:55:54Z"}
From linkerd-identity proxy container
{"timestamp":"[ 0.000787s]","level":"INFO","fields":{"message":"Using single-threaded proxy runtime"},"target":"linkerd2_proxy::rt","threadId":"ThreadId(1)"}
{"timestamp":"[ 0.002093s]","level":"INFO","fields":{"message":"Admin interface on 0.0.0.0:4191"},"target":"linkerd2_proxy","threadId":"ThreadId(1)"}
{"timestamp":"[ 0.002105s]","level":"INFO","fields":{"message":"Inbound interface on 0.0.0.0:4143"},"target":"linkerd2_proxy","threadId":"ThreadId(1)"}
{"timestamp":"[ 0.002110s]","level":"INFO","fields":{"message":"Outbound interface on 127.0.0.1:4140"},"target":"linkerd2_proxy","threadId":"ThreadId(1)"}
{"timestamp":"[ 0.002114s]","level":"INFO","fields":{"message":"Tap DISABLED"},"target":"linkerd2_proxy","threadId":"ThreadId(1)"}
{"timestamp":"[ 0.002117s]","level":"INFO","fields":{"message":"Local identity is linkerd-identity.linkerd.serviceaccount.identity.linkerd.cluster.local"},"target":"linkerd2_proxy","threadId":"ThreadId(1)"}
{"timestamp":"[ 0.002125s]","level":"INFO","fields":{"message":"Identity verified via localhost:8080"},"target":"linkerd2_proxy","threadId":"ThreadId(1)"}
{"timestamp":"[ 0.002129s]","level":"INFO","fields":{"message":"Destinations resolved via linkerd-dst-headless.linkerd.svc.cluster.local:8086 (linkerd-destination.linkerd.serviceaccount.identity.linkerd.cluster.local)"},"target":"linkerd2_proxy","threadId":"ThreadId(1)"}
{"timestamp":"[ 0.004149s]","level":"WARN","fields":{"message":"Failed to resolve control-plane component","error":"failed SRV and A record lookups: failed to resolve SRV record: no record found for name: linkerd-policy.linkerd.svc.cluster.local. type: SRV class: IN; failed to resolve A record: no record found for name: linkerd-policy.linkerd.svc.cluster.local. type: AAAA class: IN"},"target":"linkerd_app_core::control","spans":[{"name":"policy"},{"port":"8080","name":"watch"},{"addr":"linkerd-policy.linkerd.svc.cluster.local:8090","name":"controller"}],"threadId":"ThreadId(1)"}
{"timestamp":"[ 0.020429s]","level":"INFO","fields":{"message":"Certified identity: linkerd-identity.linkerd.serviceaccount.identity.linkerd.cluster.local"},"target":"linkerd_app","spans":[{"name":"daemon"},{"name":"identity"}],"threadId":"ThreadId(2)"}
{"timestamp":"[ 0.113567s]","level":"WARN","fields":{"message":"Failed to resolve control-plane component","error":"failed SRV and A record lookups: failed to resolve SRV record: no record found for name: linkerd-policy.linkerd.svc.cluster.local. type: SRV class: IN; failed to resolve A record: no record found for name: linkerd-policy.linkerd.svc.cluster.local. type: AAAA class: IN"},"target":"linkerd_app_core::control","spans":[{"name":"policy"},{"port":"8080","name":"watch"},{"addr":"linkerd-policy.linkerd.svc.cluster.local:8090","name":"controller"}],"threadId":"ThreadId(1)"}
NAME ENDPOINTS AGE
linkerd-dst 17h
linkerd-dst-headless 17h
linkerd-identity 10.52.0.63:8080 17h
linkerd-identity-headless 10.52.0.63:8080 17h
linkerd-policy 17h
linkerd-policy-validator 17h
linkerd-proxy-injector 17h
linkerd-sp-validator 17h
output of linkerd check -o short
linkerd check -o short
Linkerd core checks
===================
linkerd-existence
-----------------
× control plane pods are ready
pod/linkerd-destination-6dbdd44646-bdnk8 container linkerd-proxy is not ready
see https://linkerd.io/2.11/checks/#l5d-api-control-ready for hints
Status check results are ×
Environment
Kubernetes version - 1.23.5 AKS Cluster Host OS - linux Linkerd version - helm chart 2.11.4 Azure CNI Linkerd-CNI enabled
Possible solution
No response
Additional context
Might be same as #8496 related slack thread
Would you like to work on fixing this bug?
yes
"error":"received corrupt message" makes it sound like the identity controller is starting up without the CNI being run. That is, connections from clients aren't hitting the identity controller's proxy, they are hitting the identity controller directly (and so TLS isn't being terminated).
There's an existing issue that tracks this and some work in progress to detect/remediate this situation https://github.com/linkerd/linkerd2/issues/8120
I see, thanks for the explanation! Will wait for #8120 to see if it resolves it.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.