talos Talos Node Freezes for extended period (Kernel still responsive)

Bug Report

Talos nodes randomly freeze for extended periods (kernel still up and running)

Description

One of the nodes in my Talos 1.10.4 cluster experiences random freezes, typically occurring around 4:30 AM, lasting 10-12 minutes. During these freeze periods:

The node becomes completely unresponsive to network requests
No syslogs are generated during the freeze period
Talos components stop producing logs
The Talos console (visible via BMC) appears frozen with time not advancing
The underlying system remains responsive (CTRL+ALT+F1 to TTY works) in the node console
The kernel appears to still be running

After approximately 12 minutes, the node automatically recovers and all processes resume normal operation, generating timeout and reconnection messages. However, there are no error logs or indicators explaining what caused the 10-minute freeze period.

Frequency: Random occurrences, often around 4:30 AM Duration: 10-12 minutes consistently Impact: Node becomes unavailable for scheduling and existing workloads may be affected

Logs

dmseg before

05/07/2025 04:29:46
[talos] service[kubelet](Running): Health check failed: Get "http://127.0.0.1:10248/healthz": context deadline exceeded
05/07/2025 04:29:50
[talos] service[kubelet](Running): Health check successful
05/07/2025 04:29:56
[talos] service[kubelet](Running): Health check failed: Get "http://127.0.0.1:10248/healthz": context deadline exceeded
05/07/2025 04:30:00
[talos] service[kubelet](Running): Health check successful
05/07/2025 04:30:29
[talos] service[machined](Running): Health check failed: dial unix /system/run/machined/machine.sock: i/o timeout
05/07/2025 04:30:38
[talos] service[kubelet](Running): Health check failed: Get "http://127.0.0.1:10248/healthz": context deadline exceeded

dmesg after

05/07/2025 04:42:16
[talos] service[apid](Running): Health check failed: dial tcp 127.0.0.1:50000: i/o timeout
05/07/2025 04:42:40
[talos] service[containerd](Running): Health check failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded
05/07/2025 04:43:14
[talos] service[trustd](Running): Health check failed: dial tcp 127.0.0.1:50001: i/o timeout
05/07/2025 04:45:30
[talos] service[cri](Running): Health check failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded
05/07/2025 04:46:18
[talos] service[udevd](Running): Health check failed: context deadline exceeded:
05/07/2025 04:46:19
[talos] service[udevd](Running): Health check successful
05/07/2025 04:46:19
[talos] service[apid](Running): Health check successful
05/07/2025 04:46:19
[talos] service[etcd](Running): Health check failed: context deadline exceeded
05/07/2025 04:46:19
[talos] service[trustd](Running): Health check successful
05/07/2025 04:46:19
[talos] service[machined](Running): Health check successful
05/07/2025 04:46:19
[talos] configuring siderolink connection {"component": "controller-runtime", "controller": "siderolink.ManagerController", "peer_endpoint": "10.2.0.128:50180", "next_peer_endpoint": ""}
05/07/2025 04:46:19
[talos] siderolink connection configured {"component": "controller-runtime", "controller": "siderolink.ManagerController", "endpoint": "https://omni.home.sekops.ch:8090/?jointoken=vnuvIIi2WbonukMzQ5bWct7lNoAfqvkNZg8MO2mvIiE", "node_uuid": "00000000-0000-0000-0000-7cc2554e4930", "node_address": "fdae:41e4:649b:9303:a85c:9d8f:37eb:ba2f/64"}
05/07/2025 04:46:19
[talos] service[containerd](Running): Health check successful
05/07/2025 04:46:20
[talos] service[cri](Running): Health check successful
05/07/2025 04:46:20
[talos] service[kubelet](Running): Health check successful

Environment

Talos version:1.10.4 : 6.12.31-talos
Kubernetes version: 1.33.2
Platform: bare-metal

Jul 05 '25 09:07 bernardgut

This looks like resource exhaustion, e.g. all CPU used by some workloads. I would recommend to install monitoring.

Jul 07 '25 10:07 smira

I do have monitoring and I was checking that. Will check more carefully next time it happens. I dont have any CPU intensive tasks (that I am aware of). Will report back next time this happens

Jul 08 '25 09:07 bernardgut

I think we're stumbling uppon the same issue here. No workloads running in the cluster at all

Nov 07 '25 09:11 koslib