gravity
gravity copied to clipboard
Gravity (kubelet service) won't recover from failed node
Description
What happened: During longevity testing, one of the nodes in our cluster had a memory pressure and needed to be restarted. After the node came up again, it couldn't connect the cluster and was stuck in NotReady/degraded state.
What you expected to happen: node should become Ready after restart
How to reproduce it (as minimally and precisely as possible):
Environment
- Gravity version : 7.0.15 (client) / 7.0.15 (server)
- OS : Ubuntu 18.04
- Platform : On-Prem
Relevant Debug Logs If Applicable
Kubelet logs:
The main log which point the issue is from the node's kubelet: Mar 25 15:34:08 server15 kubelet[2208]: E0325 15:34:08.586086 2208 kubelet_node_status.go:402] Error updating node status, will retry: error getting node "10.1.1.80": Get https://leader.telekube.local:6443/api/v1/nodes/10.1.1.80?timeout=10s: write tcp 10.1.1.80:55104->10.1.1.50:6443: use of closed network connection
Gravity status output:
Cluster name: cluster.local Cluster status: degraded (one or more of cluster nodes are not healthy) Cluster image: gravity-base-k8s, version 7.0.15-11 Gravity version: 7.0.15 (client) / 7.0.15 (server) Join token: f755def2ea04deb1d5186f265a3e221d Last completed operation:
Join node server14 (10.1.1.75) as node ID: eea526da-93e1-4435-ae38-7ffbc8473221 Started: Mon Mar 1 09:10 UTC (3 weeks ago) Completed: Mon Mar 1 09:13 UTC (3 weeks ago) Cluster endpoints:
Authentication gateway:
10.1.1.50:32009
10.1.1.95:32009
Cluster management URL:
https://10.1.1.50:32009
https://10.1.1.95:32009 Cluster nodes: Masters:
rac2-kvm22 / 10.1.1.50 / master Status: healthy Remote access: online
master02 / 10.1.1.95 / master Status: healthy Remote access: online Nodes:
master03 / 10.1.1.40 / node Status: healthy Remote access: online
server01 / 10.1.1.85 / node Status: degraded [×] versions (checker does not comply with specified context, potential goroutine leak) [×] docker (checker does not comply with specified context, potential goroutine leak) [×] system-pods-checker (checker does not comply with specified context, potential goroutine leak) [×] kubelet (checker does not comply with specified context, potential goroutine leak) [×] node-status (checker does not comply with specified context, potential goroutine leak) [×] systemd (checker does not comply with specified context, potential goroutine leak) [×] nethealth-checker (checker does not comply with specified context, potential goroutine leak) Remote access: online
user / 10.1.1.35 / node Status: offline Remote access: online
server05 / 10.1.1.80 / node Status: degraded [×] MemoryPressure/NodeStatusUnknown (Kubelet stopped posting node status.) [×] DiskPressure/NodeStatusUnknown (Kubelet stopped posting node status.) [×] PIDPressure/NodeStatusUnknown (Kubelet stopped posting node status.) [×] Ready/NodeStatusUnknown (Kubelet stopped posting node status.) Remote access: online
server5 / 10.1.1.20 / node Status: healthy Remote access: online
server14 / 10.1.1.75 / node Status: healthy Remote access: online [ERROR]: degraded
restarting kubelet fixed the issue Attaching kubelet logs, i've also got the gravityreport.tar.gz (800MB) - please let me know if needed
Important to mention - telnet from gravity shell where kubelet failed to the other master 10.1.1.80:55104->10.1.1.50:6443 worked properly while getting this error. Some server names and IPs have being white labeled for security reasons.
master: 10.1.1.50 node: 10.1.1 kubelet.log .80
I'd check if it's an apiserver answering on 6443 from that node. Also check whether it's not some unexpected latency?