linkerd2
linkerd2 copied to clipboard
After node restart linkerd-cni pod hast to be restarted sometimes
What is the issue?
When restarting a node inside the cluster, where the linkerd-cni is deployed, it will sometimes fail to come up correctly. Therefore the other linkerd pods will fail to come up, because of the failing linkerd-network-validator. After restarting the cni pod all the pods will come up.
I looked at the config /etc/cni/net.d/ and i can confirm that when the linkerd pods fail to start up the linkerd config is missing. When restarting the cni pod the config is there.
I already found this bug report which stated that this is fixed. (https://github.com/linkerd/linkerd2/pull/11699). I added the repair-controller but it did not fix the issue for me. This repair controller only restarts the failing linkerd pods, but it should also restart the linkerd-cni pod.
How can it be reproduced?
- Restart a Node
- Sometimes the linkerd pods will fail to start
Logs, error output, etc
ERROR linkerd_network_validator: Unable to connect to validator. Please ensure iptables rules are rewriting traffic as expected error=Connection refused (os error 111)
output of linkerd check -o short
inkerd check -o short linkerd-identity
‼ issuer cert is valid for at least 60 days issuer certificate will expire on 2024-04-24T18:29:51Z see https://linkerd.io/2.14/checks/#l5d-identity-issuer-cert-not-expiring-soon for hints
linkerd-control-plane-proxy
| container "linkerd-proxy" in pod "linkerd-identity-7b78b4db96-rdzhv" is not ready
Environment
Kubernetes version: 1.27.11 Cluster environment: Self hosted / Rancher / 3 Nodes Host OS: Debian 12 Linkerd version: 2.14.10
Possible solution
I think this repair-controller restarts the linkerd pods, but it isnt restarting the linkerd-cni pod. There should be check if the linkerd config is correct in place.
Additional context
No response
Would you like to work on fixing this bug?
None
Can you clarify how is the linkerd-cni failing to start? (e.g. logs, events, status)
The linkerd-cni is not failing to start.
When you restart the node and the pods come back up, the cni pod starts fine and will state everything is fine. But when you have a look inside /etc/cni/net.d/10-canal.conflist you will see that the linkerd config is missing sometimes. If you restart the cni pod the config will be there. Pods that are using linkerd will not be capable to start and will bring up this error:
ERROR linkerd_network_validator: Unable to connect to validator. Please ensure iptables rules are rewriting traffic as expected error=Connection refused (os error 111)
I sometimes have to restart the cni pod and everything is ok again. The cni pod is not logging any errors.
Ok thanks for the clarification. We've released a new version for linkerd-cni that might be able to better catch when the network CNI config changes. Can you give it a try? https://github.com/linkerd/linkerd2-proxy-init/releases/tag/cni-plugin%2Fv1.5.0
Updating the CNI to this version fixed the issue. Thank you.