gravity icon indicating copy to clipboard operation
gravity copied to clipboard

Gravity (kubelet service) won't recover from failed node

Open snirkatriel opened this issue 3 years ago • 1 comments

Description

What happened: During longevity testing, one of the nodes in our cluster had a memory pressure and needed to be restarted. After the node came up again, it couldn't connect the cluster and was stuck in NotReady/degraded state.

What you expected to happen: node should become Ready after restart

How to reproduce it (as minimally and precisely as possible):

Environment

  • Gravity version : 7.0.15 (client) / 7.0.15 (server)
  • OS : Ubuntu 18.04
  • Platform : On-Prem

Relevant Debug Logs If Applicable

Kubelet logs:

The main log which point the issue is from the node's kubelet: Mar 25 15:34:08 server15 kubelet[2208]: E0325 15:34:08.586086 2208 kubelet_node_status.go:402] Error updating node status, will retry: error getting node "10.1.1.80": Get https://leader.telekube.local:6443/api/v1/nodes/10.1.1.80?timeout=10s: write tcp 10.1.1.80:55104->10.1.1.50:6443: use of closed network connection

Gravity status output:

Cluster name: cluster.local Cluster status: degraded (one or more of cluster nodes are not healthy) Cluster image: gravity-base-k8s, version 7.0.15-11 Gravity version: 7.0.15 (client) / 7.0.15 (server) Join token: f755def2ea04deb1d5186f265a3e221d Last completed operation:

Join node server14 (10.1.1.75) as node ID: eea526da-93e1-4435-ae38-7ffbc8473221 Started: Mon Mar 1 09:10 UTC (3 weeks ago) Completed: Mon Mar 1 09:13 UTC (3 weeks ago) Cluster endpoints:

Authentication gateway:

10.1.1.50:32009

10.1.1.95:32009

Cluster management URL:

https://10.1.1.50:32009

https://10.1.1.95:32009 Cluster nodes: Masters:

rac2-kvm22 / 10.1.1.50 / master Status: healthy Remote access: online

master02 / 10.1.1.95 / master Status: healthy Remote access: online Nodes:

master03 / 10.1.1.40 / node Status: healthy Remote access: online

server01 / 10.1.1.85 / node Status: degraded [×] versions (checker does not comply with specified context, potential goroutine leak) [×] docker (checker does not comply with specified context, potential goroutine leak) [×] system-pods-checker (checker does not comply with specified context, potential goroutine leak) [×] kubelet (checker does not comply with specified context, potential goroutine leak) [×] node-status (checker does not comply with specified context, potential goroutine leak) [×] systemd (checker does not comply with specified context, potential goroutine leak) [×] nethealth-checker (checker does not comply with specified context, potential goroutine leak) Remote access: online

user / 10.1.1.35 / node Status: offline Remote access: online

server05 / 10.1.1.80 / node Status: degraded [×] MemoryPressure/NodeStatusUnknown (Kubelet stopped posting node status.) [×] DiskPressure/NodeStatusUnknown (Kubelet stopped posting node status.) [×] PIDPressure/NodeStatusUnknown (Kubelet stopped posting node status.) [×] Ready/NodeStatusUnknown (Kubelet stopped posting node status.) Remote access: online

server5 / 10.1.1.20 / node Status: healthy Remote access: online

server14 / 10.1.1.75 / node Status: healthy Remote access: online [ERROR]: degraded

restarting kubelet fixed the issue Attaching kubelet logs, i've also got the gravityreport.tar.gz (800MB) - please let me know if needed

Important to mention - telnet from gravity shell where kubelet failed to the other master 10.1.1.80:55104->10.1.1.50:6443 worked properly while getting this error. Some server names and IPs have being white labeled for security reasons.

master: 10.1.1.50 node: 10.1.1 kubelet.log .80

snirkatriel avatar Mar 31 '21 08:03 snirkatriel

I'd check if it's an apiserver answering on 6443 from that node. Also check whether it's not some unexpected latency?

a-palchikov avatar Jun 24 '21 10:06 a-palchikov