katalyst-core
katalyst-core copied to clipboard
Enhanced K8s worker node network error and weird automatic restart
What happened?
I followed this documentation to install the dev k8s environment.
My master node is running fine quickly, but the worker nodes are not working due to cni plugin not initialized
Extracted from kubectl describe node debian-node-2
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
NetworkUnavailable False Mon, 10 Jun 2024 15:46:00 +0800 Mon, 10 Jun 2024 15:46:00 +0800 FlannelIsUp Flannel is running on this node
MemoryPressure False Mon, 10 Jun 2024 15:45:56 +0800 Mon, 10 Jun 2024 15:45:50 +0800 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Mon, 10 Jun 2024 15:45:56 +0800 Mon, 10 Jun 2024 15:45:50 +0800 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Mon, 10 Jun 2024 15:45:56 +0800 Mon, 10 Jun 2024 15:45:50 +0800 KubeletHasSufficientPID kubelet has sufficient PID available
Ready False Mon, 10 Jun 2024 15:45:56 +0800 Mon, 10 Jun 2024 15:45:50 +0800 KubeletNotReady container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized
Logs extraced from kubectl logs -n kube-system canal-ftzlb
(for debian-node-2
)
2024-06-10 08:58:51.323 [INFO][61] felix/watchercache.go 194: Failed to perform list of current data during resync ListRoot="/calico/resources/v3/projectcalico.org/profiles" error=Get "https://172.23.192.1:443/api/v1/namespaces?limit=500": dial tcp 172.23.192.1:443: connect: connection refused
2024-06-10 08:58:51.323 [INFO][61] felix/watchercache.go 194: Failed to perform list of current data during resync ListRoot="/calico/resources/v3/projectcalico.org/nodes" error=Get "https://172.23.192.1:443/api/v1/nodes?limit=500&resourceVersion=0&resourceVersionMatch=NotOlderThan": dial tcp 172.23.192.1:443: connect: connection refused
2024-06-10 08:58:51.323 [INFO][61] felix/watchercache.go 194: Failed to perform list of current data during resync ListRoot="/calico/resources/v3/projectcalico.org/workloadendpoints" error=Get "https://172.23.192.1:443/api/v1/pods?limit=500&resourceVersion=0&resourceVersionMatch=NotOlderThan": dial tcp 172.23.192.1:443: connect: connection refused
2024-06-10 08:58:51.323 [INFO][61] felix/watchercache.go 194: Failed to perform list of current data during resync ListRoot="/calico/resources/v3/projectcalico.org/kubernetesendpointslices" error=Get "https://172.23.192.1:443/apis/discovery.k8s.io/v1/endpointslices?limit=500&resourceVersion=0&resourceVersionMatch=NotOlderThan": dial tcp 172.23.192.1:443: connect: connection refused
2024-06-10 08:58:51.323 [INFO][61] felix/watchercache.go 194: Failed to perform list of current data during resync ListRoot="/calico/resources/v3/projectcalico.org/kubernetesnetworkpolicies" error=Get "https://172.23.192.1:443/apis/networking.k8s.io/v1/networkpolicies?limit=500&resourceVersion=0&resourceVersionMatch=NotOlderThan": dial tcp 172.23.192.1:443: connect: connection refused
2024-06-10 08:58:51.325 [INFO][61] felix/watchercache.go 194: Failed to perform list of current data during resync ListRoot="/calico/resources/v3/projectcalico.org/hostendpoints" error=Get "https://172.23.192.1:443/apis/crd.projectcalico.org/v1/hostendpoints?limit=500&resourceVersion=0&resourceVersionMatch=NotOlderThan": dial tcp 172.23.192.1:443: connect: connection refused
2024-06-10 08:58:51.560 [INFO][61] felix/watchercache.go 181: Full resync is required ListRoot="/calico/resources/v3/projectcalico.org/bgpconfigurations"
2024-06-10 08:58:51.587 [INFO][61] felix/watchercache.go 194: Failed to perform list of current data during resync ListRoot="/calico/resources/v3/projectcalico.org/globalnetworksets" error=Get "https://172.23.192.1:443/apis/crd.projectcalico.org/v1/globalnetworksets?limit=500&resourceVersion=0&resourceVersionMatch=NotOlderThan": dial tcp 172.23.192.1:443: connect: connection refused
2024-06-10 08:58:51.731 [INFO][61] felix/watchercache.go 194: Failed to perform list of current data during resync ListRoot="/calico/resources/v3/projectcalico.org/networkpolicies" error=Get "https://172.23.192.1:443/apis/crd.projectcalico.org/v1/networkpolicies?limit=500&resourceVersion=0&resourceVersionMatch=NotOlderThan": dial tcp 172.23.192.1:443: connect: connection refused
2024-06-10 08:58:51.900 [INFO][61] felix/watchercache.go 181: Full resync is required ListRoot="/calico/resources/v3/projectcalico.org/clusterinformations"
2024-06-10 08:58:51.926 [INFO][61] felix/watchercache.go 181: Full resync is required ListRoot="/calico/resources/v3/projectcalico.org/ippools"
2024-06-10 08:58:51.947 [INFO][61] felix/watchercache.go 194: Failed to perform list of current data during resync ListRoot="/calico/resources/v3/projectcalico.org/globalnetworkpolicies" error=Get "https://172.23.192.1:443/apis/crd.projectcalico.org/v1/globalnetworkpolicies?limit=500&resourceVersion=0&resourceVersionMatch=NotOlderThan": dial tcp 172.23.192.1:443: connect: connection refused
2024-06-10 08:58:52.160 [INFO][61] felix/watchercache.go 181: Full resync is required ListRoot="/calico/resources/v3/projectcalico.org/felixconfigurations"
2024-06-10 08:58:52.201 [INFO][61] felix/watchercache.go 194: Failed to perform list of current data during resync ListRoot="/calico/resources/v3/projectcalico.org/networksets" error=Get "https://172.23.192.1:443/apis/crd.projectcalico.org/v1/networksets?limit=500&resourceVersion=0&resourceVersionMatch=NotOlderThan": dial tcp 172.23.192.1:443: connect: connection refused
2024-06-10 08:58:52.323 [INFO][63] status-reporter/watchercache.go 181: Full resync is required ListRoot="/calico/resources/v3/projectcalico.org/caliconodestatuses"
2024-06-10 08:58:52.324 [INFO][61] felix/watchercache.go 181: Full resync is required ListRoot="/calico/resources/v3/projectcalico.org/profiles"
2024-06-10 08:58:52.324 [INFO][61] felix/watchercache.go 181: Full resync is required ListRoot="/calico/resources/v3/projectcalico.org/nodes"
2024-06-10 08:58:52.324 [INFO][61] felix/watchercache.go 181: Full resync is required ListRoot="/calico/resources/v3/projectcalico.org/workloadendpoints"
2024-06-10 08:58:52.324 [INFO][61] felix/watchercache.go 181: Full resync is required ListRoot="/calico/resources/v3/projectcalico.org/kubernetesnetworkpolicies"
2024-06-10 08:58:52.324 [INFO][61] felix/watchercache.go 181: Full resync is required ListRoot="/calico/resources/v3/projectcalico.org/kubernetesservice"
2024-06-10 08:58:52.324 [INFO][61] felix/watchercache.go 181: Full resync is required ListRoot="/calico/resources/v3/projectcalico.org/kubernetesendpointslices"
2024-06-10 08:58:52.326 [INFO][61] felix/watchercache.go 181: Full resync is required ListRoot="/calico/resources/v3/projectcalico.org/hostendpoints"
2024-06-10 08:58:52.454 [INFO][63] status-reporter/watchercache.go 194: Failed to perform list of current data during resync ListRoot="/calico/resources/v3/projectcalico.org/caliconodestatuses" error=Get "https://172.23.192.1:443/apis/crd.projectcalico.org/v1/caliconodestatuses?limit=500&resourceVersion=23696&resourceVersionMatch=NotOlderThan": dial tcp 172.23.192.1:443: connect: connection refused
2024-06-10 08:58:52.454 [INFO][61] felix/watchercache.go 194: Failed to perform list of current data during resync ListRoot="/calico/resources/v3/projectcalico.org/profiles" error=Get "https://172.23.192.1:443/api/v1/namespaces?limit=500": dial tcp 172.23.192.1:443: connect: connection refused
2024-06-10 08:58:52.454 [INFO][61] felix/watchercache.go 194: Failed to perform list of current data during resync ListRoot="/calico/resources/v3/projectcalico.org/nodes" error=Get "https://172.23.192.1:443/api/v1/nodes?limit=500&resourceVersion=0&resourceVersionMatch=NotOlderThan": dial tcp 172.23.192.1:443: connect: connection refused
2024-06-10 08:58:52.454 [INFO][61] felix/watchercache.go 194: Failed to perform list of current data during resync ListRoot="/calico/resources/v3/projectcalico.org/kubernetesservice" error=Get "https://172.23.192.1:443/api/v1/services?limit=500&resourceVersion=0&resourceVersionMatch=NotOlderThan": dial tcp 172.23.192.1:443: connect: connection refused
2024-06-10 08:58:52.454 [INFO][61] felix/watchercache.go 194: Failed to perform list of current data during resync ListRoot="/calico/resources/v3/projectcalico.org/workloadendpoints" error=Get "https://172.23.192.1:443/api/v1/pods?limit=500&resourceVersion=0&resourceVersionMatch=NotOlderThan": dial tcp 172.23.192.1:443: connect: connection refused
2024-06-10 08:58:52.454 [INFO][61] felix/watchercache.go 194: Failed to perform list of current data during resync ListRoot="/calico/resources/v3/projectcalico.org/kubernetesendpointslices" error=Get "https://172.23.192.1:443/apis/discovery.k8s.io/v1/endpointslices?limit=500&resourceVersion=0&resourceVersionMatch=NotOlderThan": dial tcp 172.23.192.1:443: connect: connection refused
2024-06-10 08:58:52.454 [INFO][61] felix/watchercache.go 194: Failed to perform list of current data during resync ListRoot="/calico/resources/v3/projectcalico.org/kubernetesnetworkpolicies" error=Get "https://172.23.192.1:443/apis/networking.k8s.io/v1/networkpolicies?limit=500&resourceVersion=0&resourceVersionMatch=NotOlderThan": dial tcp 172.23.192.1:443: connect: connection refused
2024-06-10 08:58:52.454 [INFO][61] felix/watchercache.go 194: Failed to perform list of current data during resync ListRoot="/calico/resources/v3/projectcalico.org/bgpconfigurations" error=Get "https://172.23.192.1:443/apis/crd.projectcalico.org/v1/bgpconfigurations?limit=500&resourceVersion=0&resourceVersionMatch=NotOlderThan": dial tcp 172.23.192.1:443: connect: connection refused
I checked the logs and it seems that the connection to 172.23.192.1:443
was refused.
But when I check iptables, it shows:
root@debian-node-2:~/deploy# sudo iptables-save | grep 172.23.192.1
-A KUBE-SERVICES -d 172.23.192.1/32 -p tcp -m comment --comment "default/kubernetes:https cluster IP" -m tcp --dport 443 -j KUBE-SVC-NPX46M4PTMTKRN6Y
-A KUBE-SERVICES -d 172.23.192.10/32 -p tcp -m comment --comment "kube-system/kube-dns:dns-tcp cluster IP" -m tcp --dport 53 -j KUBE-SVC-ERIFXISQEP7F7OF4
-A KUBE-SERVICES -d 172.23.192.10/32 -p tcp -m comment --comment "kube-system/kube-dns:metrics cluster IP" -m tcp --dport 9153 -j KUBE-SVC-JD5MR3NA4I4DYORP
-A KUBE-SERVICES -d 172.23.192.10/32 -p udp -m comment --comment "kube-system/kube-dns:dns cluster IP" -m udp --dport 53 -j KUBE-SVC-TCOU7JCQXEZGVUNU
-A KUBE-SVC-ERIFXISQEP7F7OF4 ! -s 172.28.208.0/20 -d 172.23.192.10/32 -p tcp -m comment --comment "kube-system/kube-dns:dns-tcp cluster IP" -m tcp --dport 53 -j KUBE-MARK-MASQ
-A KUBE-SVC-JD5MR3NA4I4DYORP ! -s 172.28.208.0/20 -d 172.23.192.10/32 -p tcp -m comment --comment "kube-system/kube-dns:metrics cluster IP" -m tcp --dport 9153 -j KUBE-MARK-MASQ
-A KUBE-SVC-NPX46M4PTMTKRN6Y ! -s 172.28.208.0/20 -d 172.23.192.1/32 -p tcp -m comment --comment "default/kubernetes:https cluster IP" -m tcp --dport 443 -j KUBE-MARK-MASQ
-A KUBE-SVC-TCOU7JCQXEZGVUNU ! -s 172.28.208.0/20 -d 172.23.192.10/32 -p udp -m comment --comment "kube-system/kube-dns:dns cluster IP" -m udp --dport 53 -j KUBE-MARK-MASQ
and I try use curl
and telnet
to coonnect 172.23.192.1:443
in debian-node-2
,it works
This seems to be working fine, I have been debugging this for days, but still have no luck.
I also tried to reinstall several times, but the problem reappeared stably.
Finally, I solved it by systemctl restart containerd.service
on debian-node-2
(work node)
Although I solved the problem, I still have a huge doubt, why can I just restart containerd?
At the same time, my master node (i.e., debian-node-1
) would randomly reboot, which was very confusing to me.
It usually manifested itself in a way similar to client_loop: send disconnect: Broken pipe
, which I initially thought was a problem with the ssh connection, but when I left the server alone overnight and connected again the next day, it would automatically reboot again.
I had no memory issues, and even the memory usage was not high. Command journalctl -xb -p err
did not indicate any problems.
I have previously installed Vanilla Kubernetes on debian-node-1
, and also installed a Kind-based k8s container. But there was no restart problem. I have already done a system reset before installing kubewarf-enhanced-k8s.
debian-node-2
(worker node) have same problem.
This is debian-node-1
and debian-node-2
summary from neofetch
. The two machines are connected to the same router, which has a bypass route 192.168.2.201
set up to handle the proxy.
root@debian-node-1:~# neofetch
_,met$$$$$gg. root@debian-node-1
,g$$$$$$$$$$$$$$$P. ------------------
,g$$P" """Y$$.". OS: Debian GNU/Linux 12 (bookworm) x86_64
,$$P' `$$$. Host: UM480XT
',$$P ,ggs. `$$b: Kernel: 6.1.0-21-amd64
`d$$' ,$P"' . $$$ Uptime: 1 hour
$$P d$' , $$P Packages: 573 (dpkg)
$$: $$. - ,d$$' Shell: bash 5.2.15
$$; Y$b._ _,d$P' CPU: AMD Ryzen 7 4800H with Radeon Graphics (16) @ 2.900GHz
Y$$. `.`"Y$$$$P"' GPU: AMD ATI 04:00.0 Renoir
`$$b "-.__ Memory: 1494MiB / 31529MiB
`Y$$
`Y$$.
`$$b.
`Y$$b.
`"Y$b._
`"""
root@debian-node-2:~/deploy# neofetch
_,met$$$$$gg. root@debian-node-2
,g$$$$$$$$$$$$$$$P. ------------------
,g$$P" """Y$$.". OS: Debian GNU/Linux 12 (bookworm) x86_64
,$$P' `$$$. Host: UM480XT
',$$P ,ggs. `$$b: Kernel: 6.1.0-21-amd64
`d$$' ,$P"' . $$$ Uptime: 2 hours, 49 mins
$$P d$' , $$P Packages: 517 (dpkg)
$$: $$. - ,d$$' Shell: bash 5.2.15
$$; Y$b._ _,d$P' CPU: AMD Ryzen 7 4800H with Radeon Graphics (16) @ 2.900GHz
Y$$. `.`"Y$$$$P"' GPU: AMD ATI 04:00.0 Renoir
`$$b "-.__ Memory: 703MiB / 15425MiB
`Y$$
`Y$$.
`$$b.
`Y$$b.
`"Y$b._
`"""
What did you expect to happen?
Worker nodes can run normally without systemctl restart containerd.service
Master node will not auto restart
How can we reproduce it (as minimally and precisely as possible)?
followed this documentation
Software version
debian-node-1
:
root@debian-node-1:~# kubectl version
WARNING: This version information is deprecated and will be replaced with the output from kubectl version --short. Use --output=yaml|json to get the full version.
Client Version: version.Info{Major:"1", Minor:"24+", GitVersion:"v1.24.6-kubewharf.8", GitCommit:"443c2773bbac8eeb5648f22f2b262d05e985595c", GitTreeState:"clean", BuildDate:"2024-01-04T03:56:31Z", GoVersion:"go1.18.6", Compiler:"gc", Platform:"linux/amd64"}
Kustomize Version: v4.5.4
Server Version: version.Info{Major:"1", Minor:"24+", GitVersion:"v1.24.6-kubewharf.8", GitCommit:"443c2773bbac8eeb5648f22f2b262d05e985595c", GitTreeState:"clean", BuildDate:"2024-01-04T03:51:02Z", GoVersion:"go1.18.6", Compiler:"gc", Platform:"linux/amd64"}
debian-node-2
:
root@debian-node-2:~/deploy# kubectl version
WARNING: This version information is deprecated and will be replaced with the output from kubectl version --short. Use --output=yaml|json to get the full version.
Client Version: version.Info{Major:"1", Minor:"24+", GitVersion:"v1.24.6-kubewharf.8", GitCommit:"443c2773bbac8eeb5648f22f2b262d05e985595c", GitTreeState:"clean", BuildDate:"2024-01-04T03:56:31Z", GoVersion:"go1.18.6", Compiler:"gc", Platform:"linux/amd64"}
Kustomize Version: v4.5.4
Server Version: version.Info{Major:"1", Minor:"24+", GitVersion:"v1.24.6-kubewharf.8", GitCommit:"443c2773bbac8eeb5648f22f2b262d05e985595c", GitTreeState:"clean", BuildDate:"2024-01-04T03:51:02Z", GoVersion:"go1.18.6", Compiler:"gc", Platform:"linux/amd64"}