talos icon indicating copy to clipboard operation
talos copied to clipboard

Talos Node Freezes for extended period (Kernel still responsive)

Open bernardgut opened this issue 6 months ago • 3 comments

Bug Report

Talos nodes randomly freeze for extended periods (kernel still up and running)

Description

One of the nodes in my Talos 1.10.4 cluster experiences random freezes, typically occurring around 4:30 AM, lasting 10-12 minutes. During these freeze periods:

  • The node becomes completely unresponsive to network requests
  • No syslogs are generated during the freeze period
  • Talos components stop producing logs
  • The Talos console (visible via BMC) appears frozen with time not advancing
  • The underlying system remains responsive (CTRL+ALT+F1 to TTY works) in the node console
  • The kernel appears to still be running

After approximately 12 minutes, the node automatically recovers and all processes resume normal operation, generating timeout and reconnection messages. However, there are no error logs or indicators explaining what caused the 10-minute freeze period.

Frequency: Random occurrences, often around 4:30 AM Duration: 10-12 minutes consistently Impact: Node becomes unavailable for scheduling and existing workloads may be affected

Logs

dmseg before

05/07/2025 04:29:46
[talos] service[kubelet](Running): Health check failed: Get "http://127.0.0.1:10248/healthz": context deadline exceeded
05/07/2025 04:29:50
[talos] service[kubelet](Running): Health check successful
05/07/2025 04:29:56
[talos] service[kubelet](Running): Health check failed: Get "http://127.0.0.1:10248/healthz": context deadline exceeded
05/07/2025 04:30:00
[talos] service[kubelet](Running): Health check successful
05/07/2025 04:30:29
[talos] service[machined](Running): Health check failed: dial unix /system/run/machined/machine.sock: i/o timeout
05/07/2025 04:30:38
[talos] service[kubelet](Running): Health check failed: Get "http://127.0.0.1:10248/healthz": context deadline exceeded 

dmesg after

05/07/2025 04:42:16
[talos] service[apid](Running): Health check failed: dial tcp 127.0.0.1:50000: i/o timeout
05/07/2025 04:42:40
[talos] service[containerd](Running): Health check failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded
05/07/2025 04:43:14
[talos] service[trustd](Running): Health check failed: dial tcp 127.0.0.1:50001: i/o timeout
05/07/2025 04:45:30
[talos] service[cri](Running): Health check failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded
05/07/2025 04:46:18
[talos] service[udevd](Running): Health check failed: context deadline exceeded:
05/07/2025 04:46:19
[talos] service[udevd](Running): Health check successful
05/07/2025 04:46:19
[talos] service[apid](Running): Health check successful
05/07/2025 04:46:19
[talos] service[etcd](Running): Health check failed: context deadline exceeded
05/07/2025 04:46:19
[talos] service[trustd](Running): Health check successful
05/07/2025 04:46:19
[talos] service[machined](Running): Health check successful
05/07/2025 04:46:19
[talos] configuring siderolink connection {"component": "controller-runtime", "controller": "siderolink.ManagerController", "peer_endpoint": "10.2.0.128:50180", "next_peer_endpoint": ""}
05/07/2025 04:46:19
[talos] siderolink connection configured {"component": "controller-runtime", "controller": "siderolink.ManagerController", "endpoint": "https://omni.home.sekops.ch:8090/?jointoken=vnuvIIi2WbonukMzQ5bWct7lNoAfqvkNZg8MO2mvIiE", "node_uuid": "00000000-0000-0000-0000-7cc2554e4930", "node_address": "fdae:41e4:649b:9303:a85c:9d8f:37eb:ba2f/64"}
05/07/2025 04:46:19
[talos] service[containerd](Running): Health check successful
05/07/2025 04:46:20
[talos] service[cri](Running): Health check successful
05/07/2025 04:46:20
[talos] service[kubelet](Running): Health check successful 

Environment

  • Talos version:1.10.4 : 6.12.31-talos
  • Kubernetes version: 1.33.2
  • Platform: bare-metal

bernardgut avatar Jul 05 '25 09:07 bernardgut

This looks like resource exhaustion, e.g. all CPU used by some workloads. I would recommend to install monitoring.

smira avatar Jul 07 '25 10:07 smira

I do have monitoring and I was checking that. Will check more carefully next time it happens. I dont have any CPU intensive tasks (that I am aware of). Will report back next time this happens

bernardgut avatar Jul 08 '25 09:07 bernardgut

I think we're stumbling uppon the same issue here. No workloads running in the cluster at all

koslib avatar Nov 07 '25 09:11 koslib