Talos Node Freezes for extended period (Kernel still responsive)
Bug Report
Talos nodes randomly freeze for extended periods (kernel still up and running)
Description
One of the nodes in my Talos 1.10.4 cluster experiences random freezes, typically occurring around 4:30 AM, lasting 10-12 minutes. During these freeze periods:
- The node becomes completely unresponsive to network requests
- No syslogs are generated during the freeze period
- Talos components stop producing logs
- The Talos console (visible via BMC) appears frozen with time not advancing
- The underlying system remains responsive (CTRL+ALT+F1 to TTY works) in the node console
- The kernel appears to still be running
After approximately 12 minutes, the node automatically recovers and all processes resume normal operation, generating timeout and reconnection messages. However, there are no error logs or indicators explaining what caused the 10-minute freeze period.
Frequency: Random occurrences, often around 4:30 AM Duration: 10-12 minutes consistently Impact: Node becomes unavailable for scheduling and existing workloads may be affected
Logs
dmseg before
05/07/2025 04:29:46
[talos] service[kubelet](Running): Health check failed: Get "http://127.0.0.1:10248/healthz": context deadline exceeded
05/07/2025 04:29:50
[talos] service[kubelet](Running): Health check successful
05/07/2025 04:29:56
[talos] service[kubelet](Running): Health check failed: Get "http://127.0.0.1:10248/healthz": context deadline exceeded
05/07/2025 04:30:00
[talos] service[kubelet](Running): Health check successful
05/07/2025 04:30:29
[talos] service[machined](Running): Health check failed: dial unix /system/run/machined/machine.sock: i/o timeout
05/07/2025 04:30:38
[talos] service[kubelet](Running): Health check failed: Get "http://127.0.0.1:10248/healthz": context deadline exceeded
dmesg after
05/07/2025 04:42:16
[talos] service[apid](Running): Health check failed: dial tcp 127.0.0.1:50000: i/o timeout
05/07/2025 04:42:40
[talos] service[containerd](Running): Health check failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded
05/07/2025 04:43:14
[talos] service[trustd](Running): Health check failed: dial tcp 127.0.0.1:50001: i/o timeout
05/07/2025 04:45:30
[talos] service[cri](Running): Health check failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded
05/07/2025 04:46:18
[talos] service[udevd](Running): Health check failed: context deadline exceeded:
05/07/2025 04:46:19
[talos] service[udevd](Running): Health check successful
05/07/2025 04:46:19
[talos] service[apid](Running): Health check successful
05/07/2025 04:46:19
[talos] service[etcd](Running): Health check failed: context deadline exceeded
05/07/2025 04:46:19
[talos] service[trustd](Running): Health check successful
05/07/2025 04:46:19
[talos] service[machined](Running): Health check successful
05/07/2025 04:46:19
[talos] configuring siderolink connection {"component": "controller-runtime", "controller": "siderolink.ManagerController", "peer_endpoint": "10.2.0.128:50180", "next_peer_endpoint": ""}
05/07/2025 04:46:19
[talos] siderolink connection configured {"component": "controller-runtime", "controller": "siderolink.ManagerController", "endpoint": "https://omni.home.sekops.ch:8090/?jointoken=vnuvIIi2WbonukMzQ5bWct7lNoAfqvkNZg8MO2mvIiE", "node_uuid": "00000000-0000-0000-0000-7cc2554e4930", "node_address": "fdae:41e4:649b:9303:a85c:9d8f:37eb:ba2f/64"}
05/07/2025 04:46:19
[talos] service[containerd](Running): Health check successful
05/07/2025 04:46:20
[talos] service[cri](Running): Health check successful
05/07/2025 04:46:20
[talos] service[kubelet](Running): Health check successful
Environment
- Talos version:1.10.4 : 6.12.31-talos
- Kubernetes version: 1.33.2
- Platform: bare-metal
This looks like resource exhaustion, e.g. all CPU used by some workloads. I would recommend to install monitoring.
I do have monitoring and I was checking that. Will check more carefully next time it happens. I dont have any CPU intensive tasks (that I am aware of). Will report back next time this happens
I think we're stumbling uppon the same issue here. No workloads running in the cluster at all