linkerd2 icon indicating copy to clipboard operation
linkerd2 copied to clipboard

After node restart linkerd-cni pod hast to be restarted sometimes

Open msr-financial-com opened this issue 10 months ago • 3 comments

What is the issue?

When restarting a node inside the cluster, where the linkerd-cni is deployed, it will sometimes fail to come up correctly. Therefore the other linkerd pods will fail to come up, because of the failing linkerd-network-validator. After restarting the cni pod all the pods will come up.

I looked at the config /etc/cni/net.d/ and i can confirm that when the linkerd pods fail to start up the linkerd config is missing. When restarting the cni pod the config is there.

I already found this bug report which stated that this is fixed. (https://github.com/linkerd/linkerd2/pull/11699). I added the repair-controller but it did not fix the issue for me. This repair controller only restarts the failing linkerd pods, but it should also restart the linkerd-cni pod.

How can it be reproduced?

  1. Restart a Node
  2. Sometimes the linkerd pods will fail to start

Logs, error output, etc

ERROR linkerd_network_validator: Unable to connect to validator. Please ensure iptables rules are rewriting traffic as expected error=Connection refused (os error 111)

output of linkerd check -o short

inkerd check -o short linkerd-identity

‼ issuer cert is valid for at least 60 days issuer certificate will expire on 2024-04-24T18:29:51Z see https://linkerd.io/2.14/checks/#l5d-identity-issuer-cert-not-expiring-soon for hints

linkerd-control-plane-proxy

| container "linkerd-proxy" in pod "linkerd-identity-7b78b4db96-rdzhv" is not ready

Environment

Kubernetes version: 1.27.11 Cluster environment: Self hosted / Rancher / 3 Nodes Host OS: Debian 12 Linkerd version: 2.14.10

Possible solution

I think this repair-controller restarts the linkerd pods, but it isnt restarting the linkerd-cni pod. There should be check if the linkerd config is correct in place.

Additional context

No response

Would you like to work on fixing this bug?

None

msr-financial-com avatar Apr 23 '24 09:04 msr-financial-com

Can you clarify how is the linkerd-cni failing to start? (e.g. logs, events, status)

alpeb avatar Apr 25 '24 10:04 alpeb

The linkerd-cni is not failing to start. When you restart the node and the pods come back up, the cni pod starts fine and will state everything is fine. But when you have a look inside /etc/cni/net.d/10-canal.conflist you will see that the linkerd config is missing sometimes. If you restart the cni pod the config will be there. Pods that are using linkerd will not be capable to start and will bring up this error: ERROR linkerd_network_validator: Unable to connect to validator. Please ensure iptables rules are rewriting traffic as expected error=Connection refused (os error 111)

I sometimes have to restart the cni pod and everything is ok again. The cni pod is not logging any errors.

msr-financial-com avatar Apr 30 '24 12:04 msr-financial-com

Ok thanks for the clarification. We've released a new version for linkerd-cni that might be able to better catch when the network CNI config changes. Can you give it a try? https://github.com/linkerd/linkerd2-proxy-init/releases/tag/cni-plugin%2Fv1.5.0

alpeb avatar May 02 '24 15:05 alpeb

Updating the CNI to this version fixed the issue. Thank you.

msr-financial-com avatar Jun 04 '24 07:06 msr-financial-com