talos
talos copied to clipboard
static pods don't recover after containerd SIGHUP
Bug Report
Description
Since updating to 1.10.1, one of my nodes has some etcd flapping on boot. It seems eventually etcd settles, but machined does not seem to notice and never starts up the static pods, so the node has no api server, scheduler, etc.
Logs
kantai1: user: warning: [2025-05-09T20:02:40.416854359Z]: [talos] service[kubelet](Waiting): Error running Containerd(kubelet), going to restart forever: task "kubelet" failed: exit code 255
kantai1: user: warning: [2025-05-09T20:02:40.416881359Z]: [talos] service[etcd](Waiting): Error running Containerd(etcd), going to restart forever: task "etcd" failed: exit code 255
kantai1: user: warning: [2025-05-09T20:02:40.417083359Z]: [talos] removed static pod {"component": "controller-runtime", "controller": "k8s.StaticPodServerController", "id": "kube-apiserver"}
kantai1: user: warning: [2025-05-09T20:02:40.417105359Z]: [talos] removed static pod {"component": "controller-runtime", "controller": "k8s.StaticPodServerController", "id": "kube-controller-manager"}
kantai1: user: warning: [2025-05-09T20:02:40.417119359Z]: [talos] removed static pod {"component": "controller-runtime", "controller": "k8s.StaticPodServerController", "id": "kube-scheduler"}
kantai1: user: warning: [2025-05-09T20:02:40.417226359Z]: [talos] service[cri](Waiting): Error running Process(["/bin/containerd" "--address" "/run/containerd/containerd.sock" "--config" "/etc/cri/containerd.toml"]), going to restart forever: signal: hangup
kantai1: user: warning: [2025-05-09T20:02:45.418257359Z]: [talos] service[etcd](Waiting): Error running Containerd(etcd), going to restart forever: failed to create task: "etcd": connection error: desc = "transport: Error while dialing: dial unix /run/containerd/containerd.sock: connect: connection refused"
kantai1: user: warning: [2025-05-09T20:02:45.418472359Z]: [talos] service[kubelet](Waiting): Error running Containerd(kubelet), going to restart forever: failed to create task: "kubelet": connection error: desc = "transport: Error while dialing: dial unix /run/containerd/containerd.sock: connect: connection refused"
kantai1: user: warning: [2025-05-09T20:02:45.450909359Z]: [talos] service[cri](Running): Process Process(["/bin/containerd" "--address" "/run/containerd/containerd.sock" "--config" "/etc/cri/containerd.toml"]) started with PID 63467
kantai1: user: warning: [2025-05-09T20:02:49.782320359Z]: [talos] service[cri](Running): Health check successful
kantai1: user: warning: [2025-05-09T20:02:50.419485359Z]: [talos] service[etcd](Waiting): Error running Containerd(etcd), going to restart forever: failed to create task: "etcd": connection error: desc = "transport: Error while dialing: dial unix /run/containerd/containerd.sock: connect: connection refused"
kantai1: user: warning: [2025-05-09T20:02:52.651691359Z]: [talos] service[kubelet](Running): Started task kubelet (PID 68195) for container kubelet
Environment
- Talos version: v1.10.1
- Kubernetes version: v1.33.0
- Platform: baremetal
This is a support bundle after rebooting the same node where it eventually settled OK. Same pattern where etcd and/or containerd restart a few times, but in the end the static pods were rendered and the node reached a good steady state.
Do you have any addons on Talos that might send SIGHUP? I don't think Talos does ever?
Hey i have the same error. it seems random to me when this happens. To solve the problem, we usually had to restart talos. Very rarely, the static pods come up again on their own.
Logs
ml-training-server: user: warning: [2025-05-14T13:57:31.924363305Z]: [talos] service[etcd](Waiting): Error running Containerd(etcd), going to restart forever: task "etcd" failed: exit code 255
ml-training-server: user: warning: [2025-05-14T13:57:31.924852305Z]: [talos] service[kubelet](Waiting): Error running Containerd(kubelet), going to restart forever: task "kubelet" failed: exit code 255
ml-training-server: user: warning: [2025-05-14T13:57:31.925100305Z]: [talos] service[cri](Waiting): Error running Process(["/bin/containerd" "--address" "/run/containerd/containerd.sock" "--config" "/etc/cri/containerd.toml"]), going to restart forever: signal: segmentation fault
ml-training-server: user: warning: [2025-05-14T13:57:31.926124305Z]: [talos] removed static pod {"component": "controller-runtime", "controller": "k8s.StaticPodServerController", "id": "kube-apiserver"}
ml-training-server: user: warning: [2025-05-14T13:57:31.926171305Z]: [talos] removed static pod {"component": "controller-runtime", "controller": "k8s.StaticPodServerController", "id": "kube-controller-manager"}
ml-training-server: user: warning: [2025-05-14T13:57:31.926210305Z]: [talos] removed static pod {"component": "controller-runtime", "controller": "k8s.StaticPodServerController", "id": "kube-scheduler"}
ml-training-server: user: warning: [2025-05-14T13:57:36.927291305Z]: [talos] service[etcd](Waiting): Error running Containerd(etcd), going to restart forever: failed to create task: "etcd": connection error: desc = "transport: Error while dialing: dial unix /run/containerd/containerd.sock: connect: connection refused"
ml-training-server: user: warning: [2025-05-14T13:57:36.927439305Z]: [talos] service[kubelet](Waiting): Error running Containerd(kubelet), going to restart forever: failed to create task: "kubelet": connection error: desc = "transport: Error while dialing: dial unix /run/containerd/containerd.sock: connect: connection refused"
ml-training-server: user: warning: [2025-05-14T13:57:36.964435305Z]: [talos] service[cri](Running): Process Process(["/bin/containerd" "--address" "/run/containerd/containerd.sock" "--config" "/etc/cri/containerd.toml"]) started with PID 257669
ml-training-server: user: warning: [2025-05-14T13:57:41.928862305Z]: [talos] service[etcd](Waiting): Error running Containerd(etcd), going to restart forever: failed to create task: "etcd": connection error: desc = "transport: Error while dialing: dial unix /run/containerd/containerd.sock: connect: connection refused"
ml-training-server: user: warning: [2025-05-14T13:57:44.166169305Z]: [talos] service[kubelet](Running): Started task kubelet (PID 259506) for container kubelet
ml-training-server: user: warning: [2025-05-14T13:57:45.197204305Z]: [talos] service[cri](Running): Health check successful
Extensions:
- qemu-guest-agent
- nvidia-open-gpu-kernel-modules-production
- nvidia-container-toolkit-production
Environment
- Talos version: v1.10.0
- Kubernetes version: v1.33.0
- Platform: proxmox (8.4.1)
@samuelkees your issue is different, and it was fixed in 1.10.1, let's please not mix all into one, thank you.
@smira Thanks, didn't see the difference. I will test the new version.
Do you have any addons on Talos that might send SIGHUP? I don't think Talos does ever?
OK, this is a bit embarrassing. This issue is just a continuation of #9271. I thought I had disabled nvidia container toolkit's installer's restart runtime logic, but the code has changed and it wasn't being disabled anymore.
You can keep this open if you want to harden Talos against this.
I think we can protect from it via SELinux only @dsseng ?
So the problem is that runtime ran by containerd sends sighup to the containerd? Might not be easy to mitigate because we don't transition at the runtime execution but rather when the pod processes start.
I think it's nvidia container toolkit, so should be a pod started by containerd sends SIGHUP to containerd
I think it's nvidia container toolkit, so should be a pod started by containerd sends SIGHUP to containerd
Ah, okay, that would probably be blocked under the policy as pods may only signal other pods
I do have SELinux disabled at the moment.
Possibly, if it makes sense, SIGHUP could be masked by containerd's parent process (machined?). execve does not reset ignored signals. Though if contained then programs its signal mask they won't help.
I need to check, but I believe the NVIDIA toolkit management daemonset pod is privileged. Would it still be blocked by the SELinux policy?
I need to check, but I believe the NVIDIA toolkit management daemonset pod is privileged. Would it still be blocked by the SELinux policy?
If Talos SELinux is enforced (enforce=1), it would protect Talos core (containerd) from any workloads.
I am also experiencing this! I've been struggling to get my etcd cluster back to a healthy state. I wiped a node to upgrade the k8s version and even after a few more wipes I can't get it to join the the etcd cluster. I am not running nvidia container toolkit on these nodes.
@VirtualDisk please open a specific issue, it doesn't help to comment on other issues without providing any logs.
Also please keep in mind there's a known issue in 1.10.0 that was fixed in 1.10.1
This issue is stale because it has been open 180 days with no activity. Remove stale label or comment or this will be closed in 7 days.
This is still an issue.
The NVIDIA GPU operator (which I am still working on for Talos in my personal time) schedules an NVIDIA Container Toolkit daemonset on every node. This pod can do a number of things (driver installation and CDI spec generation are the main ones).
For this bug, the key action is configuring containerd to add the NVIDIA runtimes. As part of this process, with the latest version, the CTK will emit a containerd drop-in that can be imported by the main Talos-managed containerd config (it also patches the exiting top-level config to add an import statement -- this won't work on Talos but can be ignored/bypassed). After it does that, it wants to signal containerd to force it to reload its config to make the runtimes available.
This only really matters the first time containerd is configured on a node, if you update the GPU operator or CTK, or change the operator configuration. You can configure the operator to disable the signal, but then nodes that do need to restart containerd won't make progress (wrt GPU enablement) until manual intervention.
There were several fixes to Talos related to restarts, so it might have been fixed already in 1.12.
But I don't think Talos 1.12 would allow anything to change containerd config for security reasons.
There were several fixes to Talos related to restarts, so it might have been fixed already in 1.12.
I'll give it a try on final 1.12 (don't have time to deploy betas right now).
But I don't think Talos 1.12 would allow anything to change containerd config for security reasons.
It's more of an opt-in by the cluster admin: add an imports statement to /etc/cri/conf.d/20-customization.part via the machineconfig, pointing at the path where the toolkit installer has been configured to output its drop-in config file. The only missing part is letting the CTK installer restart containerd. But that's full of gremlins in and of itself.
Taking a step back, one question to ask is "should a Talos node survive containerd exiting for any reason". Could be a crash, could be config hot-reload via a signal, etc. I think that should be "yes" and properly tested/engineered. Then we could look at specific use cases.
Taking a step back, one question to ask is "should a Talos node survive containerd exiting for any reason". Could be a crash, could be config hot-reload via a signal, etc. I think that should be "yes" and properly tested/engineered. Then we could look at specific use cases.
Yes, Talos should definitely handle it, if it's not the case, we would consider it to be a bug.
There were a bunch of issues related to this process fixed in 1.12, but if there's anything else, we're happy to look into it.