linkerd2 icon indicating copy to clipboard operation
linkerd2 copied to clipboard

destination and proxy-injector proxies can't get certified by identity after node startup

Open perrness opened this issue 3 years ago • 2 comments

What is the issue?

When a new node is starting up, linkerd-destination an linkerd-proxy-injector pods are stuck in a crashloop because linkerd-proxy container is in a failed state. It seems like they can't get certified by linkerd-identity. The issue can be resolved by redeploying linkerd-identity manually, which resolves all issues and everything works fine.

Running in AKS with terraform and helm install, when a new node starts.

How can it be reproduced?

We are running in AKS with automatic node image upgrades, the node is drained and a new node starts up. I was also able to reproduce it with my own personal test cluster, by just stopping and starting the cluster. You can see my setup here.

Logs, error output, etc

From linkerd-destination proxy container

{"timestamp":"[   149.882911s]","level":"DEBUG","fields":{"message":"Connecting","server.addr":"10.52.0.63:8080"},"target":"linkerd_proxy_transport::connect","spans":[{"name":"identity"},{"server.addr":"linkerd-identity-headless.linkerd.svc.cluster.local:8080","name":"identity"},{"addr":"linkerd-identity-headless.linkerd.svc.cluster.local:8080","name":"controller"},{"addr":"10.52.0.63:8080","name":"endpoint"}],"threadId":"ThreadId(2)"}                          
{"timestamp":"[   149.883213s]","level":"DEBUG","fields":{"message":"Connected","local.addr":"10.52.0.17:38572","keepalive":"Some(10s)"},"target":"linkerd_proxy_transport::connect","spans":[{"name":"identity"},{"server.addr":"linkerd-identity-headless.linkerd.svc.cluster.local:8080","name":"identity"},{"addr":"linkerd-identity-headless.linkerd.svc.cluster.local:8080","name":"controller"},{"addr":"10.52.0.63:8080","name":"endpoint"},{"name":"h2"}],"threadId":"ThreadId(2)"}                                                                                                                                                
{"timestamp":"[   149.883655s]","level":"WARN","fields":{"message":"Failed to connect","error":"received corrupt message"},"target":"linkerd_reconnect","spans":[{"name":"identity"},{"server.addr":"linkerd-identity-headless.linkerd.svc.cluster.local:8080","name":"identity"},{"addr":"linkerd-identity-headless.linkerd.svc.cluster.local:8080","name":"controller"},{"addr":"10.52.0.63:8080","name":"endpoint"}],"threadId":"ThreadId(2)"}                                
{"timestamp":"[   149.883681s]","level":"DEBUG","fields":{"message":"Recovering"},"target":"linkerd_reconnect","spans":[{"name":"identity"},{"server.addr":"linkerd-identity-headless.linkerd.svc.cluster.local:8080","name":"identity"},{"addr":"linkerd-identity-headless.linkerd.svc.cluster.local:8080","name":"controller"},{"addr":"10.52.0.63:8080","name":"endpoint"}],"threadId":"ThreadId(2)"}                                                                         
{"timestamp":"[   149.883689s]","level":"DEBUG","fields":{"message":"Disconnected","backoff":"true"},"target":"linkerd_reconnect","spans":[{"name":"identity"},{"server.addr":"linkerd-identity-headless.linkerd.svc.cluster.local:8080","name":"identity"},{"addr":"linkerd-identity-headless.linkerd.svc.cluster.local:8080","name":"controller"},{"addr":"10.52.0.63:8080","name":"endpoint"}],"threadId":"ThreadId(2)"}                                                      
{"timestamp":"[   150.013569s]","level":"WARN","fields":{"message":"Waiting for identity to be initialized..."},"target":"linkerd_app","threadId":"ThreadId(1)"} 

From linkerd-proxy-injector proxy container

{"timestamp":"[   149.864874s]","level":"WARN","fields":{"message":"Failed to connect","error":"received corrupt message"},"target":"linkerd_reconnect","spans":[{"name":"identity"},{"addr":"linkerd-identity-headless.linkerd.svc.cluster.local:8080","name":"controller"},{"addr":"10.52.0.63:8080","name":"endpoint"}],"threadId":"ThreadId(2)"}

From linkerd-identity

{"level":"info","msg":"running version stable-2.11.4","time":"2022-07-24T11:54:31Z"}
{"level":"info","msg":"starting admin server on :9990","time":"2022-07-24T11:54:31Z"}
{"level":"info","msg":"starting gRPC server on :8080","time":"2022-07-24T11:54:31Z"}
{"level":"info","msg":"issued certificate for linkerd-identity.linkerd.serviceaccount.identity.linkerd.cluster.local until 2022-07-25 11:56:14 +0000 UTC: 5ba00f0d3e1b2f5fd47ab1780601e38a","time":"2022-07-24T11:55:54Z"}   

From linkerd-identity proxy container

{"timestamp":"[     0.000787s]","level":"INFO","fields":{"message":"Using single-threaded proxy runtime"},"target":"linkerd2_proxy::rt","threadId":"ThreadId(1)"}                                                                                                                                                     
{"timestamp":"[     0.002093s]","level":"INFO","fields":{"message":"Admin interface on 0.0.0.0:4191"},"target":"linkerd2_proxy","threadId":"ThreadId(1)"}  
{"timestamp":"[     0.002105s]","level":"INFO","fields":{"message":"Inbound interface on 0.0.0.0:4143"},"target":"linkerd2_proxy","threadId":"ThreadId(1)"}
{"timestamp":"[     0.002110s]","level":"INFO","fields":{"message":"Outbound interface on 127.0.0.1:4140"},"target":"linkerd2_proxy","threadId":"ThreadId(1)"}                                                                                                                                                        
{"timestamp":"[     0.002114s]","level":"INFO","fields":{"message":"Tap DISABLED"},"target":"linkerd2_proxy","threadId":"ThreadId(1)"}                     
{"timestamp":"[     0.002117s]","level":"INFO","fields":{"message":"Local identity is linkerd-identity.linkerd.serviceaccount.identity.linkerd.cluster.local"},"target":"linkerd2_proxy","threadId":"ThreadId(1)"}                                                                                                    
{"timestamp":"[     0.002125s]","level":"INFO","fields":{"message":"Identity verified via localhost:8080"},"target":"linkerd2_proxy","threadId":"ThreadId(1)"}                                                                                                                                                        
{"timestamp":"[     0.002129s]","level":"INFO","fields":{"message":"Destinations resolved via linkerd-dst-headless.linkerd.svc.cluster.local:8086 (linkerd-destination.linkerd.serviceaccount.identity.linkerd.cluster.local)"},"target":"linkerd2_proxy","threadId":"ThreadId(1)"}                                   
{"timestamp":"[     0.004149s]","level":"WARN","fields":{"message":"Failed to resolve control-plane component","error":"failed SRV and A record lookups: failed to resolve SRV record: no record found for name: linkerd-policy.linkerd.svc.cluster.local. type: SRV class: IN; failed to resolve A record: no record found for name: linkerd-policy.linkerd.svc.cluster.local. type: AAAA class: IN"},"target":"linkerd_app_core::control","spans":[{"name":"policy"},{"port":"8080","name":"watch"},{"addr":"linkerd-policy.linkerd.svc.cluster.local:8090","name":"controller"}],"threadId":"ThreadId(1)"}                               
{"timestamp":"[     0.020429s]","level":"INFO","fields":{"message":"Certified identity: linkerd-identity.linkerd.serviceaccount.identity.linkerd.cluster.local"},"target":"linkerd_app","spans":[{"name":"daemon"},{"name":"identity"}],"threadId":"ThreadId(2)"}                                                     
{"timestamp":"[     0.113567s]","level":"WARN","fields":{"message":"Failed to resolve control-plane component","error":"failed SRV and A record lookups: failed to resolve SRV record: no record found for name: linkerd-policy.linkerd.svc.cluster.local. type: SRV class: IN; failed to resolve A record: no record found for name: linkerd-policy.linkerd.svc.cluster.local. type: AAAA class: IN"},"target":"linkerd_app_core::control","spans":[{"name":"policy"},{"port":"8080","name":"watch"},{"addr":"linkerd-policy.linkerd.svc.cluster.local:8090","name":"controller"}],"threadId":"ThreadId(1)"}
NAME                        ENDPOINTS         AGE
linkerd-dst                                   17h
linkerd-dst-headless                          17h
linkerd-identity            10.52.0.63:8080   17h
linkerd-identity-headless   10.52.0.63:8080   17h
linkerd-policy                                17h
linkerd-policy-validator                      17h
linkerd-proxy-injector                        17h
linkerd-sp-validator                          17h

output of linkerd check -o short

linkerd check -o short
Linkerd core checks
===================

linkerd-existence
-----------------
× control plane pods are ready
    pod/linkerd-destination-6dbdd44646-bdnk8 container linkerd-proxy is not ready
    see https://linkerd.io/2.11/checks/#l5d-api-control-ready for hints

Status check results are ×

Environment

Kubernetes version - 1.23.5 AKS Cluster Host OS - linux Linkerd version - helm chart 2.11.4 Azure CNI Linkerd-CNI enabled

Possible solution

No response

Additional context

Might be same as #8496 related slack thread

Would you like to work on fixing this bug?

yes

perrness avatar Jul 24 '22 12:07 perrness

"error":"received corrupt message" makes it sound like the identity controller is starting up without the CNI being run. That is, connections from clients aren't hitting the identity controller's proxy, they are hitting the identity controller directly (and so TLS isn't being terminated).

There's an existing issue that tracks this and some work in progress to detect/remediate this situation https://github.com/linkerd/linkerd2/issues/8120

olix0r avatar Jul 26 '22 15:07 olix0r

I see, thanks for the explanation! Will wait for #8120 to see if it resolves it.

perrness avatar Jul 27 '22 09:07 perrness

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Oct 27 '22 17:10 stale[bot]