talos icon indicating copy to clipboard operation
talos copied to clipboard

kubelet container image "not found" does not retry

Open Ulexus opened this issue 8 months ago • 9 comments

Bug Report

New with v1.10:

The kubelet service attempts to start, but it uses the incorrect DNS server (8.8.8.8), which fails to resolve. This occurs long after the machineconfig with the correct DNS servers is up and otherwise operational. After this failure, kubelet never comes up. Manually restarting the kubelet service via talosctl service restart kubelet does allow the kubelet to come up.

When running v1.9.6, this works (logs attached at the bottom).

Description

From console logs:

  • Containerd is started before the correct DNS server is loaded: line 17
  • Correct DNS server is loaded: line 97
  • Containerd resolves using incorrect DNS server: line 201

Effect: kubelet never starts because its resolution fails.

Manually restarting kubelet service causes it to come up.

Logs

Full logs attached; trimmed logs below:

 user: warning: [2025-05-19T23:46:35.229803076Z]: [talos] service[containerd](Running): Process Process(["/bin/containerd" "--address" "/system/run/containerd/containerd.sock" "--state" "/system/run/containerd" "--root" "/system/var/lib/containerd"]) started      
 with PID 1127                                                                                                                                                                                                                                                          
 user: warning: [2025-05-19T23:46:35.969378076Z]: [talos] service[containerd](Running): Health check successful                                                                                                                                                         
 user: warning: [2025-05-19T23:46:37.719428076Z]: [talos] setting resolvers {"component": "controller-runtime", "controller": "network.ResolverSpecController", "resolvers": ["169.254.169.253", "fd00:ec2::253"], "searchDomains": ["ec2.internal"]}    
 user: warning: [2025-05-19T23:46:38.761669076Z]: [talos] service[cri](Starting): Starting service                                                                                                                                                                      
 user: warning: [2025-05-19T23:46:38.850301076Z]: [talos] service[cri](Running): Process Process(["/bin/containerd" "--address" "/run/containerd/containerd.sock" "--config" "/etc/cri/containerd.toml"]) started with PID 2035                                         
 user: warning: [2025-05-19T23:46:39.152175076Z]: [talos] service[kubelet](Starting): Starting service                                                                                                                                                                  
 user: warning: [2025-05-19T23:46:39.152860076Z]: [talos] service[kubelet](Waiting): Waiting for volume "/var/lib" to be mounted, volume "/var/lib/kubelet" to be mounted, volume "/var/log" to be mounted, volume "/var/log/audit" to be mounted, volume "/var/log/    
 containers" to be mounted, volume "/var/log/pods" to be mounted, volume "/var/lib/kubelet/seccomp" to be mounted, volume "/var/lib/kubelet/seccomp/profiles" to be mounted, volume "/var/log/audit/kube" to be mounted, volume "/var/mnt" to be mounted, service       
 "cri" to be "up", time sync, network                                                                                                                                                                                                                                   
 user: warning: [2025-05-19T23:46:39.175239076Z]: [talos] created route {"component": "controller-runtime", "controller": "network.RouteSpecController", "destination": "default", "gateway": "", "table": "RoutingTable(180)", "link": "kubespan", "priority": 1,      
 "family": "inet4"}                                                                                                                                                                                                                                                     
 user: warning: [2025-05-19T23:46:39.177365076Z]: [talos] created route {"component": "controller-runtime", "controller": "network.RouteSpecController", "destination": "default", "gateway": "", "table": "RoutingTable(180)", "link": "kubespan", "priority": 1,      
 "family": "inet6"}                                                                                                                                                                                                                                                     
 user: warning: [2025-05-19T23:46:39.794763076Z]: [talos] service[cri](Running): Health check successful                                                                                                                                                                
 user: warning: [2025-05-19T23:46:39.795498076Z]: [talos] service[kubelet](Preparing): Running pre state                                                                                                                                                                
 user: warning: [2025-05-19T23:46:53.665193076Z]: [talos] task startAllServices (1/1): service "apid" to be "up", service "kubelet" to be "up"                                                                                                                          
 user: warning: [2025-05-19T23:46:59.715580076Z]: level=info msg=fetch failed error=failed to do request: Head "https://ghcr.io/v2/siderolabs/kubelet/manifests/sha256:0072b6738306b927cb85ad53999c2f9691f2f533cff22f4afc30350c3b9e62bb": dial tcp: lookup ghcr.io on   
 8.8.8.8:53: read udp 10.224.128.22:59060->8.8.8.8:53: i/o timeout host=ghcr.io image=ghcr.io/siderolabs/kubelet@sha256:0072b6738306b927cb85ad53999c2f9691f2f533cff22f4afc30350c3b9e62bb                                                                                
 user: warning: [2025-05-19T23:46:59.779107076Z]: level=info msg=fetch failed after status: 404 Not Found host=ghcr.io image=ghcr.io/siderolabs/kubelet@sha256:0072b6738306b927cb85ad53999c2f9691f2f533cff22f4afc30350c3b9e62bb                                         
 user: warning: [2025-05-19T23:46:59.787276076Z]: [talos] service[kubelet](Failed): Failed to run pre stage: 1 error(s) occurred:                                                                                                                                       
 user: warning: [2025-05-19T23:46:59.790284076Z]:  failed to pull image "ghcr.io/siderolabs/kubelet@sha256:0072b6738306b927cb85ad53999c2f9691f2f533cff22f4afc30350c3b9e62bb": failed to resolve reference "ghcr.io/siderolabs/                                          
 kubelet@sha256:0072b6738306b927cb85ad53999c2f9691f2f533cff22f4afc30350c3b9e62bb": ghcr.io/siderolabs/kubelet@sha256:0072b6738306b927cb85ad53999c2f9691f2f533cff22f4afc30350c3b9e62bb: not found                                                                        

console-log.txt

As seen from services:

$ talosctl services
NODE                                     SERVICE      STATE     HEALTH   LAST CHANGE   LAST EVENT
fd26:7b81:1226:c202:cd4:81ff:feef:e9ff   apid         Running   OK       4m5s ago      Health check successful
fd26:7b81:1226:c202:cd4:81ff:feef:e9ff   auditd       Running   OK       4m31s ago     Health check successful
fd26:7b81:1226:c202:cd4:81ff:feef:e9ff   containerd   Running   OK       4m31s ago     Health check successful
fd26:7b81:1226:c202:cd4:81ff:feef:e9ff   cri          Running   OK       4m28s ago     Health check successful
fd26:7b81:1226:c202:cd4:81ff:feef:e9ff   dashboard    Running   ?        4m30s ago     Process Process(["/sbin/dashboard"]) started with PID 1949
fd26:7b81:1226:c202:cd4:81ff:feef:e9ff   kubelet      Failed    ?        4m7s ago      Failed to run pre stage: 1 error(s) occurred:
                                         failed to pull image "ghcr.io/siderolabs/kubelet@sha256:0072b6738306b927cb85ad53999c2f9691f2f533cff22f4afc30350c3b9e62bb": failed to resolve reference "ghcr.io/siderolabs/kubelet@sha256:0072b6738306b927cb85ad53999c2f9691f2f533cff22f4afc30350c3b9e62bb": ghcr.io/siderolabs/kubelet@sha256:0072b6738306b927cb85ad53999c2f9691f2f533cff22f4afc30350c3b9e62bb: not found
fd26:7b81:1226:c202:cd4:81ff:feef:e9ff   machined   Running   OK   4m31s ago   Health check successful
fd26:7b81:1226:c202:cd4:81ff:feef:e9ff   syslogd    Running   OK   4m30s ago   Health check successful
fd26:7b81:1226:c202:cd4:81ff:feef:e9ff   udevd      Running   OK   4m32s ago   Health check successful
$ talosctl service kubelet restart
NODE                                     RESPONSE
fd26:7b81:1226:c202:cd4:81ff:feef:e9ff   Service "kubelet" restarted
$ talosctl services
NODE                                     SERVICE      STATE     HEALTH   LAST CHANGE   LAST EVENT
fd26:7b81:1226:c202:cd4:81ff:feef:e9ff   apid         Running   OK       6m57s ago     Health check successful
fd26:7b81:1226:c202:cd4:81ff:feef:e9ff   auditd       Running   OK       7m24s ago     Health check successful
fd26:7b81:1226:c202:cd4:81ff:feef:e9ff   containerd   Running   OK       7m24s ago     Health check successful
fd26:7b81:1226:c202:cd4:81ff:feef:e9ff   cri          Running   OK       7m20s ago     Health check successful
fd26:7b81:1226:c202:cd4:81ff:feef:e9ff   dashboard    Running   ?        7m22s ago     Process Process(["/sbin/dashboard"]) started with PID 1949
fd26:7b81:1226:c202:cd4:81ff:feef:e9ff   kubelet      Running   ?        1s ago        Started task kubelet (PID 2159) for container kubelet
fd26:7b81:1226:c202:cd4:81ff:feef:e9ff   machined     Running   OK       7m24s ago     Health check successful
fd26:7b81:1226:c202:cd4:81ff:feef:e9ff   syslogd      Running   OK       7m23s ago     Health check successful
fd26:7b81:1226:c202:cd4:81ff:feef:e9ff   udevd        Running   OK       7m24s ago     Health check successful

For comparison, here are the logs from a new, working v1.9.6 node in the same cluster.

working-console-log.txt

Environment

  • Talos version: v1.10.1 (though also fails in v1.10.0)
  • Kubernetes version: 1.33.0
  • Platform: various x86 on AWS

Ulexus avatar May 20 '25 00:05 Ulexus

I don't think the analysis is correct here, there are two pull errors:

  • failure to do DNS lookup which is retried
  • HTTP 404 from the registry which is not retried, as it's a terminal error (image is not found)

So I don't see a bug so far. New resolver settings are correctly picked up after a retry.

smira avatar May 20 '25 10:05 smira

Yeah, it looks like this is coming from containerd, not Talos. I haven't completely run through that morass, but I'm concerned that, at least at a high level, they are using the NotFound term to be inclusive of "not resolved"... but I haven't found code to indicate that that is what is happening; rather, it appears to be as you indicate, and it's just the logs that are confusing.

Outside of that, though, I think the premise is wrong: just because an image is not found at one point in time does not mean it will never be found. That is, while a 404 error should be cause to cancel immediate retries, as it is semantically a terminal error, they should not constitute a forever condition.

Ulexus avatar May 20 '25 12:05 Ulexus

Talos is not completely stuck there - e.g. changing an image reference to a correct one would cause a new attempt, but I don't think Talos should retry on its own in this condition.

smira avatar May 20 '25 12:05 smira

No, it's not stuck in the sense that an external event (such as the restart of kubelet described above) is sufficient to get it unstuck... but this doesn't facilitate automation.

The issue with image availability is that it is not a fixed-over-time value. It is entirely reasonable to presume that the unavailability of an image is an unavailability yet of that image. It does beg the question of how often and under what conditions Talos should retry, but I would suggest doing the same thing that Kubernetes does in this condition: a bounded exponential backoff.

In this example, presuming all the current assumptions are true, it is the fault of the ghcr.io server to have misrepresented a 404, and a simple retry would have resolved the problem. A more common case, though, would be that the image is not yet available for some reason (slow mirror, new service which hasn't yet synced, out-of-order or non-deterministically-ordered deployment, etc). In all of these cases, a bounded exponential backoff retry strategy would automatically resolve the issue without causing undue load on the container registry even if the image never appears.

Ulexus avatar May 20 '25 16:05 Ulexus

Still running into this issue:

user: warning: [2025-06-10T01:47:52.695925092Z]: level=info msg=fetch failed error=failed to do request: Head "https://ghcr.io/v2/siderolabs/kubelet/manifests/sha256:5bcfb0bb5643cac7161a6f72fc4f71d33a865f3049b99f3807f1b8d85f916bfe": dial tcp: lookup ghcr.io on 8.8.8.8:53: read udp 10.128.152.4:41986->8.8.8.8:53: i/o timeout host=ghcr.io image=ghcr.io/siderolabs/kubelet@sha256:5bcfb0bb5643cac7161a6f72fc4f71d33a865f3049b99f3807f1b8d85f916bfe                                                                                                                                                                         
user: warning: [2025-06-10T01:47:52.752588092Z]: level=info msg=fetch failed after status: 404 Not Found host=ghcr.io image=ghcr.io/siderolabs/kubelet@sha256:5bcfb0bb5643cac7161a6f72fc4f71d33a865f3049b99f3807f1b8d85f916bfe                                                                                         
user: warning: [2025-06-10T01:47:52.756024092Z]: [talos] service[kubelet](Failed): Failed to run pre stage: 1 error(s) occurred:                                                                                                                                                                                       
user: warning: [2025-06-10T01:47:52.756636092Z]:  failed to pull image "ghcr.io/siderolabs/kubelet@sha256:5bcfb0bb5643cac7161a6f72fc4f71d33a865f3049b99f3807f1b8d85f916bfe": failed to resolve reference "ghcr.io/siderolabs/kubelet@sha256:5bcfb0bb5643cac7161a6f72fc4f71d33a865f3049b99f3807f1b8d85f916bfe":         
 ghcr.io/siderolabs/kubelet@sha256:5bcfb0bb5643cac7161a6f72fc4f71d33a865f3049b99f3807f1b8d85f916bfe: not found                                                                                                                                                                                                          

FYI I worked around it for this machine by running:

talosctl image pull --namespace system ghcr.io/siderolabs/kubelet@sha256:5bcfb0bb5643cac7161a6f72fc4f71d33a865f3049b99f3807f1b8d85f916bfe

But the workaround is a non-starter for dynamically created nodes.

@smira surely the pull should be retried when the correct DNS records are picked up?

james-callahan avatar Jun 10 '25 02:06 james-callahan

@smira surely the pull should be retried when the correct DNS records are picked up?

Plus see that you get not found error, not DNS error. This is a fatal/terminal error. DNS error is retried.

smira avatar Jun 10 '25 09:06 smira

Plus see that you get not found error, not DNS error. This is a fatal/terminal error. DNS error is retried.

I don't think there is a 404 being returned by the ghcr.io server; so it must be being synthesised by something in talos/containerd

Digging in a little, I think the info log is coming from https://github.com/containerd/containerd/blob/cb1076646aa3740577fafbf3d914198b7fe8e3f7/core/remotes/docker/resolver.go#L330

I note that there are several paths that set the result to errdefs.ErrNotFound

Is there a way to view debug level logs for containerd? (e.g. hoping to see https://github.com/containerd/containerd/blob/cb1076646aa3740577fafbf3d914198b7fe8e3f7/core/remotes/docker/resolver.go#L309C13-L309C15)

james-callahan avatar Jun 10 '25 12:06 james-callahan

This is not coming from containerd process, but still the error Not Found is returned when the image is not present.

It might have been an issue with ghcr.io, but let's please keep this discussion in other issue/separate discussion.

smira avatar Jun 10 '25 12:06 smira

This still seems to be relevant to the issue's title: whatever the cause of the not found condition, that should not be treated as a permanent error outside of a single download attempt. Even if this is coming from GHCR and it is correct at the time of the request, it is the nature of asynchronous systems to have delayed availability of the images. (In these cases, it is not that, because the images have been independently verified to exist long before the download attempt... but I'm trying to illustrate the fundamental fallacy of the current strategy.)

Ulexus avatar Jun 10 '25 16:06 Ulexus

I can't reproduce this issue, and I do not have any explanation for HTTP 404, but I don't see how it can report 404 without actually getting HTTP 404.

Also this error never appears in our test environment, so I think there is something else in your config/environment which triggers it.

Nevertheless, we will add limited retries for 404s.

smira avatar Jul 02 '25 12:07 smira