calico
calico copied to clipboard
The k8s node lost network (drop all the packets) after the calico-node pod is manually restarted in eBPF mode
Hi guys,
I was testing calico's ebpf mode in recent days, I found that when I manually killed the calico-node pod for reliability test, the node's network was lost and it could not respond to any external connections.
I have to login the node through external console (just like vmrc or ibmc) and use tc-s qdisc show dev
to find that the number of drop packets is increasing. Only by restarting the node can the node network be restored to normal.
I have searched for a long time in calico official documentation, google and github. Have not found the answer when, this question is very tricky, if it happens in the production environment it will be a disaster!
please feel free to give advice! Orz
Expected Behavior
k8s node work normally after restart calico-node pod
Current Behavior
k8s node work normally after restart calico-node pod
Possible Solution
No
Steps to Reproduce (for bugs)
1.Make sure calico has enabled eBPF mode. Here is the felixconfiguration
2. Make sure all the daemonsets of calico-node work normally and the pod could see the client's real source ip
3.Kill calico-node pod manually and node lost network
Obviously nodes become NotReady after a while.
4.Check the node network connectivity and eBPF qdisc clsact status
Using command ping %gateway%
and tc -s qdisc show dev ens160
.
It can not transmit any icmp packet and the number of drops growing rapidly.
Context
Just lab env, but if it happens in the production environment it will be a disaster!
Your Environment
- Calico version: 3.24.5
- Orchestrator version (e.g. kubernetes, mesos, rkt): v1.25.9 on-prem kubeadm kubernetes cluster with one master and two worker node
- Operating System and version: Ubuntu 20.04.6
- Link to your project (optional): None