talos
talos copied to clipboard
Kubernetes and talos services crashing under memory pressure
Bug Report
Description
One of our worker nodes crashes rarely. Both kubelet and apid. Since apid also crashes, we have not yet been able to collect any logs.
The problem is solved by restarting the node.
Logs
Not able to receive any yet, but the node get's under DiskPressure and MemoryPressure.
We are in the process of implementing some form of log collection and will provide logs asap.
Environment
- Talos version: v1.5.4
- Kubernetes version: v1.28.3
- Platform: QEMU KVM / Proxmox
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
NetworkUnavailable False Wed, 06 Dec 2023 02:47:10 +0100 Wed, 06 Dec 2023 02:47:10 +0100 CiliumIsUp Cilium is running on this node
MemoryPressure False Tue, 09 Jan 2024 14:38:57 +0100 Tue, 09 Jan 2024 14:33:51 +0100 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Tue, 09 Jan 2024 14:38:57 +0100 Tue, 09 Jan 2024 14:33:51 +0100 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Tue, 09 Jan 2024 14:38:57 +0100 Tue, 09 Jan 2024 14:33:51 +0100 KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Tue, 09 Jan 2024 14:38:57 +0100 Tue, 09 Jan 2024 14:33:51 +0100 KubeletReady kubelet is posting ready status
Node conditions on crash.
Talos services has some cgroup reservation, so it'd be nice to see the logs around the crash, as it might be something else.
btw the conditions look good