talos icon indicating copy to clipboard operation
talos copied to clipboard

static pods don't recover after containerd SIGHUP

Open jfroy opened this issue 6 months ago • 13 comments

Bug Report

Description

Since updating to 1.10.1, one of my nodes has some etcd flapping on boot. It seems eventually etcd settles, but machined does not seem to notice and never starts up the static pods, so the node has no api server, scheduler, etc.

Logs

kantai1: user: warning: [2025-05-09T20:02:40.416854359Z]: [talos] service[kubelet](Waiting): Error running Containerd(kubelet), going to restart forever: task "kubelet" failed: exit code 255
kantai1: user: warning: [2025-05-09T20:02:40.416881359Z]: [talos] service[etcd](Waiting): Error running Containerd(etcd), going to restart forever: task "etcd" failed: exit code 255
kantai1: user: warning: [2025-05-09T20:02:40.417083359Z]: [talos] removed static pod {"component": "controller-runtime", "controller": "k8s.StaticPodServerController", "id": "kube-apiserver"}
kantai1: user: warning: [2025-05-09T20:02:40.417105359Z]: [talos] removed static pod {"component": "controller-runtime", "controller": "k8s.StaticPodServerController", "id": "kube-controller-manager"}
kantai1: user: warning: [2025-05-09T20:02:40.417119359Z]: [talos] removed static pod {"component": "controller-runtime", "controller": "k8s.StaticPodServerController", "id": "kube-scheduler"}
kantai1: user: warning: [2025-05-09T20:02:40.417226359Z]: [talos] service[cri](Waiting): Error running Process(["/bin/containerd" "--address" "/run/containerd/containerd.sock" "--config" "/etc/cri/containerd.toml"]), going to restart forever: signal: hangup
kantai1: user: warning: [2025-05-09T20:02:45.418257359Z]: [talos] service[etcd](Waiting): Error running Containerd(etcd), going to restart forever: failed to create task: "etcd": connection error: desc = "transport: Error while dialing: dial unix /run/containerd/containerd.sock: connect: connection refused"
kantai1: user: warning: [2025-05-09T20:02:45.418472359Z]: [talos] service[kubelet](Waiting): Error running Containerd(kubelet), going to restart forever: failed to create task: "kubelet": connection error: desc = "transport: Error while dialing: dial unix /run/containerd/containerd.sock: connect: connection refused"
kantai1: user: warning: [2025-05-09T20:02:45.450909359Z]: [talos] service[cri](Running): Process Process(["/bin/containerd" "--address" "/run/containerd/containerd.sock" "--config" "/etc/cri/containerd.toml"]) started with PID 63467
kantai1: user: warning: [2025-05-09T20:02:49.782320359Z]: [talos] service[cri](Running): Health check successful
kantai1: user: warning: [2025-05-09T20:02:50.419485359Z]: [talos] service[etcd](Waiting): Error running Containerd(etcd), going to restart forever: failed to create task: "etcd": connection error: desc = "transport: Error while dialing: dial unix /run/containerd/containerd.sock: connect: connection refused"
kantai1: user: warning: [2025-05-09T20:02:52.651691359Z]: [talos] service[kubelet](Running): Started task kubelet (PID 68195) for container kubelet

Environment

  • Talos version: v1.10.1
  • Kubernetes version: v1.33.0
  • Platform: baremetal

support.zip

jfroy avatar May 09 '25 20:05 jfroy

This is a support bundle after rebooting the same node where it eventually settled OK. Same pattern where etcd and/or containerd restart a few times, but in the end the static pods were rendered and the node reached a good steady state.

support-ok-boot.zip

jfroy avatar May 10 '25 04:05 jfroy

Do you have any addons on Talos that might send SIGHUP? I don't think Talos does ever?

smira avatar May 13 '25 12:05 smira

Hey i have the same error. it seems random to me when this happens. To solve the problem, we usually had to restart talos. Very rarely, the static pods come up again on their own.

Logs

ml-training-server: user: warning: [2025-05-14T13:57:31.924363305Z]: [talos] service[etcd](Waiting): Error running Containerd(etcd), going to restart forever: task "etcd" failed: exit code 255
ml-training-server: user: warning: [2025-05-14T13:57:31.924852305Z]: [talos] service[kubelet](Waiting): Error running Containerd(kubelet), going to restart forever: task "kubelet" failed: exit code 255
ml-training-server: user: warning: [2025-05-14T13:57:31.925100305Z]: [talos] service[cri](Waiting): Error running Process(["/bin/containerd" "--address" "/run/containerd/containerd.sock" "--config" "/etc/cri/containerd.toml"]), going to restart forever: signal: segmentation fault
ml-training-server: user: warning: [2025-05-14T13:57:31.926124305Z]: [talos] removed static pod {"component": "controller-runtime", "controller": "k8s.StaticPodServerController", "id": "kube-apiserver"}
ml-training-server: user: warning: [2025-05-14T13:57:31.926171305Z]: [talos] removed static pod {"component": "controller-runtime", "controller": "k8s.StaticPodServerController", "id": "kube-controller-manager"}
ml-training-server: user: warning: [2025-05-14T13:57:31.926210305Z]: [talos] removed static pod {"component": "controller-runtime", "controller": "k8s.StaticPodServerController", "id": "kube-scheduler"}
ml-training-server: user: warning: [2025-05-14T13:57:36.927291305Z]: [talos] service[etcd](Waiting): Error running Containerd(etcd), going to restart forever: failed to create task: "etcd": connection error: desc = "transport: Error while dialing: dial unix /run/containerd/containerd.sock: connect: connection refused"
ml-training-server: user: warning: [2025-05-14T13:57:36.927439305Z]: [talos] service[kubelet](Waiting): Error running Containerd(kubelet), going to restart forever: failed to create task: "kubelet": connection error: desc = "transport: Error while dialing: dial unix /run/containerd/containerd.sock: connect: connection refused"
ml-training-server: user: warning: [2025-05-14T13:57:36.964435305Z]: [talos] service[cri](Running): Process Process(["/bin/containerd" "--address" "/run/containerd/containerd.sock" "--config" "/etc/cri/containerd.toml"]) started with PID 257669
ml-training-server: user: warning: [2025-05-14T13:57:41.928862305Z]: [talos] service[etcd](Waiting): Error running Containerd(etcd), going to restart forever: failed to create task: "etcd": connection error: desc = "transport: Error while dialing: dial unix /run/containerd/containerd.sock: connect: connection refused"
ml-training-server: user: warning: [2025-05-14T13:57:44.166169305Z]: [talos] service[kubelet](Running): Started task kubelet (PID 259506) for container kubelet
ml-training-server: user: warning: [2025-05-14T13:57:45.197204305Z]: [talos] service[cri](Running): Health check successful

Extensions:

  • qemu-guest-agent
  • nvidia-open-gpu-kernel-modules-production
  • nvidia-container-toolkit-production

Environment

  • Talos version: v1.10.0
  • Kubernetes version: v1.33.0
  • Platform: proxmox (8.4.1)

samuelkees avatar May 14 '25 14:05 samuelkees

@samuelkees your issue is different, and it was fixed in 1.10.1, let's please not mix all into one, thank you.

smira avatar May 14 '25 14:05 smira

@smira Thanks, didn't see the difference. I will test the new version.

samuelkees avatar May 14 '25 14:05 samuelkees

Do you have any addons on Talos that might send SIGHUP? I don't think Talos does ever?

OK, this is a bit embarrassing. This issue is just a continuation of #9271. I thought I had disabled nvidia container toolkit's installer's restart runtime logic, but the code has changed and it wasn't being disabled anymore.

You can keep this open if you want to harden Talos against this.

jfroy avatar May 14 '25 20:05 jfroy

I think we can protect from it via SELinux only @dsseng ?

smira avatar May 15 '25 09:05 smira

So the problem is that runtime ran by containerd sends sighup to the containerd? Might not be easy to mitigate because we don't transition at the runtime execution but rather when the pod processes start.

dsseng avatar May 15 '25 10:05 dsseng

I think it's nvidia container toolkit, so should be a pod started by containerd sends SIGHUP to containerd

smira avatar May 15 '25 10:05 smira

I think it's nvidia container toolkit, so should be a pod started by containerd sends SIGHUP to containerd

Ah, okay, that would probably be blocked under the policy as pods may only signal other pods

dsseng avatar May 15 '25 10:05 dsseng

I do have SELinux disabled at the moment.

Possibly, if it makes sense, SIGHUP could be masked by containerd's parent process (machined?). execve does not reset ignored signals. Though if contained then programs its signal mask they won't help.

I need to check, but I believe the NVIDIA toolkit management daemonset pod is privileged. Would it still be blocked by the SELinux policy?

jfroy avatar May 15 '25 14:05 jfroy

I need to check, but I believe the NVIDIA toolkit management daemonset pod is privileged. Would it still be blocked by the SELinux policy?

If Talos SELinux is enforced (enforce=1), it would protect Talos core (containerd) from any workloads.

smira avatar May 15 '25 14:05 smira

I am also experiencing this! I've been struggling to get my etcd cluster back to a healthy state. I wiped a node to upgrade the k8s version and even after a few more wipes I can't get it to join the the etcd cluster. I am not running nvidia container toolkit on these nodes.

@VirtualDisk please open a specific issue, it doesn't help to comment on other issues without providing any logs.

Also please keep in mind there's a known issue in 1.10.0 that was fixed in 1.10.1

smira avatar May 16 '25 08:05 smira

This issue is stale because it has been open 180 days with no activity. Remove stale label or comment or this will be closed in 7 days.

github-actions[bot] avatar Nov 14 '25 02:11 github-actions[bot]

This is still an issue.

The NVIDIA GPU operator (which I am still working on for Talos in my personal time) schedules an NVIDIA Container Toolkit daemonset on every node. This pod can do a number of things (driver installation and CDI spec generation are the main ones).

For this bug, the key action is configuring containerd to add the NVIDIA runtimes. As part of this process, with the latest version, the CTK will emit a containerd drop-in that can be imported by the main Talos-managed containerd config (it also patches the exiting top-level config to add an import statement -- this won't work on Talos but can be ignored/bypassed). After it does that, it wants to signal containerd to force it to reload its config to make the runtimes available.

This only really matters the first time containerd is configured on a node, if you update the GPU operator or CTK, or change the operator configuration. You can configure the operator to disable the signal, but then nodes that do need to restart containerd won't make progress (wrt GPU enablement) until manual intervention.

jfroy avatar Nov 14 '25 03:11 jfroy

There were several fixes to Talos related to restarts, so it might have been fixed already in 1.12.

But I don't think Talos 1.12 would allow anything to change containerd config for security reasons.

smira avatar Nov 14 '25 07:11 smira

There were several fixes to Talos related to restarts, so it might have been fixed already in 1.12.

I'll give it a try on final 1.12 (don't have time to deploy betas right now).

But I don't think Talos 1.12 would allow anything to change containerd config for security reasons.

It's more of an opt-in by the cluster admin: add an imports statement to /etc/cri/conf.d/20-customization.part via the machineconfig, pointing at the path where the toolkit installer has been configured to output its drop-in config file. The only missing part is letting the CTK installer restart containerd. But that's full of gremlins in and of itself.

Taking a step back, one question to ask is "should a Talos node survive containerd exiting for any reason". Could be a crash, could be config hot-reload via a signal, etc. I think that should be "yes" and properly tested/engineered. Then we could look at specific use cases.

jfroy avatar Nov 14 '25 18:11 jfroy

Taking a step back, one question to ask is "should a Talos node survive containerd exiting for any reason". Could be a crash, could be config hot-reload via a signal, etc. I think that should be "yes" and properly tested/engineered. Then we could look at specific use cases.

Yes, Talos should definitely handle it, if it's not the case, we would consider it to be a bug.

There were a bunch of issues related to this process fixed in 1.12, but if there's anything else, we're happy to look into it.

smira avatar Nov 15 '25 10:11 smira