kubelet container image "not found" does not retry
Bug Report
New with v1.10:
The kubelet service attempts to start, but it uses the incorrect DNS server (8.8.8.8), which fails to resolve. This occurs long after the machineconfig with the correct DNS servers is up and otherwise operational. After this failure, kubelet never comes up.
Manually restarting the kubelet service via talosctl service restart kubelet does allow the kubelet to come up.
When running v1.9.6, this works (logs attached at the bottom).
Description
From console logs:
- Containerd is started before the correct DNS server is loaded: line 17
- Correct DNS server is loaded: line 97
- Containerd resolves using incorrect DNS server: line 201
Effect: kubelet never starts because its resolution fails.
Manually restarting kubelet service causes it to come up.
Logs
Full logs attached; trimmed logs below:
user: warning: [2025-05-19T23:46:35.229803076Z]: [talos] service[containerd](Running): Process Process(["/bin/containerd" "--address" "/system/run/containerd/containerd.sock" "--state" "/system/run/containerd" "--root" "/system/var/lib/containerd"]) started
with PID 1127
user: warning: [2025-05-19T23:46:35.969378076Z]: [talos] service[containerd](Running): Health check successful
user: warning: [2025-05-19T23:46:37.719428076Z]: [talos] setting resolvers {"component": "controller-runtime", "controller": "network.ResolverSpecController", "resolvers": ["169.254.169.253", "fd00:ec2::253"], "searchDomains": ["ec2.internal"]}
user: warning: [2025-05-19T23:46:38.761669076Z]: [talos] service[cri](Starting): Starting service
user: warning: [2025-05-19T23:46:38.850301076Z]: [talos] service[cri](Running): Process Process(["/bin/containerd" "--address" "/run/containerd/containerd.sock" "--config" "/etc/cri/containerd.toml"]) started with PID 2035
user: warning: [2025-05-19T23:46:39.152175076Z]: [talos] service[kubelet](Starting): Starting service
user: warning: [2025-05-19T23:46:39.152860076Z]: [talos] service[kubelet](Waiting): Waiting for volume "/var/lib" to be mounted, volume "/var/lib/kubelet" to be mounted, volume "/var/log" to be mounted, volume "/var/log/audit" to be mounted, volume "/var/log/
containers" to be mounted, volume "/var/log/pods" to be mounted, volume "/var/lib/kubelet/seccomp" to be mounted, volume "/var/lib/kubelet/seccomp/profiles" to be mounted, volume "/var/log/audit/kube" to be mounted, volume "/var/mnt" to be mounted, service
"cri" to be "up", time sync, network
user: warning: [2025-05-19T23:46:39.175239076Z]: [talos] created route {"component": "controller-runtime", "controller": "network.RouteSpecController", "destination": "default", "gateway": "", "table": "RoutingTable(180)", "link": "kubespan", "priority": 1,
"family": "inet4"}
user: warning: [2025-05-19T23:46:39.177365076Z]: [talos] created route {"component": "controller-runtime", "controller": "network.RouteSpecController", "destination": "default", "gateway": "", "table": "RoutingTable(180)", "link": "kubespan", "priority": 1,
"family": "inet6"}
user: warning: [2025-05-19T23:46:39.794763076Z]: [talos] service[cri](Running): Health check successful
user: warning: [2025-05-19T23:46:39.795498076Z]: [talos] service[kubelet](Preparing): Running pre state
user: warning: [2025-05-19T23:46:53.665193076Z]: [talos] task startAllServices (1/1): service "apid" to be "up", service "kubelet" to be "up"
user: warning: [2025-05-19T23:46:59.715580076Z]: level=info msg=fetch failed error=failed to do request: Head "https://ghcr.io/v2/siderolabs/kubelet/manifests/sha256:0072b6738306b927cb85ad53999c2f9691f2f533cff22f4afc30350c3b9e62bb": dial tcp: lookup ghcr.io on
8.8.8.8:53: read udp 10.224.128.22:59060->8.8.8.8:53: i/o timeout host=ghcr.io image=ghcr.io/siderolabs/kubelet@sha256:0072b6738306b927cb85ad53999c2f9691f2f533cff22f4afc30350c3b9e62bb
user: warning: [2025-05-19T23:46:59.779107076Z]: level=info msg=fetch failed after status: 404 Not Found host=ghcr.io image=ghcr.io/siderolabs/kubelet@sha256:0072b6738306b927cb85ad53999c2f9691f2f533cff22f4afc30350c3b9e62bb
user: warning: [2025-05-19T23:46:59.787276076Z]: [talos] service[kubelet](Failed): Failed to run pre stage: 1 error(s) occurred:
user: warning: [2025-05-19T23:46:59.790284076Z]: failed to pull image "ghcr.io/siderolabs/kubelet@sha256:0072b6738306b927cb85ad53999c2f9691f2f533cff22f4afc30350c3b9e62bb": failed to resolve reference "ghcr.io/siderolabs/
kubelet@sha256:0072b6738306b927cb85ad53999c2f9691f2f533cff22f4afc30350c3b9e62bb": ghcr.io/siderolabs/kubelet@sha256:0072b6738306b927cb85ad53999c2f9691f2f533cff22f4afc30350c3b9e62bb: not found
As seen from services:
$ talosctl services
NODE SERVICE STATE HEALTH LAST CHANGE LAST EVENT
fd26:7b81:1226:c202:cd4:81ff:feef:e9ff apid Running OK 4m5s ago Health check successful
fd26:7b81:1226:c202:cd4:81ff:feef:e9ff auditd Running OK 4m31s ago Health check successful
fd26:7b81:1226:c202:cd4:81ff:feef:e9ff containerd Running OK 4m31s ago Health check successful
fd26:7b81:1226:c202:cd4:81ff:feef:e9ff cri Running OK 4m28s ago Health check successful
fd26:7b81:1226:c202:cd4:81ff:feef:e9ff dashboard Running ? 4m30s ago Process Process(["/sbin/dashboard"]) started with PID 1949
fd26:7b81:1226:c202:cd4:81ff:feef:e9ff kubelet Failed ? 4m7s ago Failed to run pre stage: 1 error(s) occurred:
failed to pull image "ghcr.io/siderolabs/kubelet@sha256:0072b6738306b927cb85ad53999c2f9691f2f533cff22f4afc30350c3b9e62bb": failed to resolve reference "ghcr.io/siderolabs/kubelet@sha256:0072b6738306b927cb85ad53999c2f9691f2f533cff22f4afc30350c3b9e62bb": ghcr.io/siderolabs/kubelet@sha256:0072b6738306b927cb85ad53999c2f9691f2f533cff22f4afc30350c3b9e62bb: not found
fd26:7b81:1226:c202:cd4:81ff:feef:e9ff machined Running OK 4m31s ago Health check successful
fd26:7b81:1226:c202:cd4:81ff:feef:e9ff syslogd Running OK 4m30s ago Health check successful
fd26:7b81:1226:c202:cd4:81ff:feef:e9ff udevd Running OK 4m32s ago Health check successful
$ talosctl service kubelet restart
NODE RESPONSE
fd26:7b81:1226:c202:cd4:81ff:feef:e9ff Service "kubelet" restarted
$ talosctl services
NODE SERVICE STATE HEALTH LAST CHANGE LAST EVENT
fd26:7b81:1226:c202:cd4:81ff:feef:e9ff apid Running OK 6m57s ago Health check successful
fd26:7b81:1226:c202:cd4:81ff:feef:e9ff auditd Running OK 7m24s ago Health check successful
fd26:7b81:1226:c202:cd4:81ff:feef:e9ff containerd Running OK 7m24s ago Health check successful
fd26:7b81:1226:c202:cd4:81ff:feef:e9ff cri Running OK 7m20s ago Health check successful
fd26:7b81:1226:c202:cd4:81ff:feef:e9ff dashboard Running ? 7m22s ago Process Process(["/sbin/dashboard"]) started with PID 1949
fd26:7b81:1226:c202:cd4:81ff:feef:e9ff kubelet Running ? 1s ago Started task kubelet (PID 2159) for container kubelet
fd26:7b81:1226:c202:cd4:81ff:feef:e9ff machined Running OK 7m24s ago Health check successful
fd26:7b81:1226:c202:cd4:81ff:feef:e9ff syslogd Running OK 7m23s ago Health check successful
fd26:7b81:1226:c202:cd4:81ff:feef:e9ff udevd Running OK 7m24s ago Health check successful
For comparison, here are the logs from a new, working v1.9.6 node in the same cluster.
Environment
- Talos version: v1.10.1 (though also fails in v1.10.0)
- Kubernetes version: 1.33.0
- Platform: various x86 on AWS
I don't think the analysis is correct here, there are two pull errors:
- failure to do DNS lookup which is retried
- HTTP 404 from the registry which is not retried, as it's a terminal error (image is not found)
So I don't see a bug so far. New resolver settings are correctly picked up after a retry.
Yeah, it looks like this is coming from containerd, not Talos. I haven't completely run through that morass, but I'm concerned that, at least at a high level, they are using the NotFound term to be inclusive of "not resolved"... but I haven't found code to indicate that that is what is happening; rather, it appears to be as you indicate, and it's just the logs that are confusing.
Outside of that, though, I think the premise is wrong: just because an image is not found at one point in time does not mean it will never be found. That is, while a 404 error should be cause to cancel immediate retries, as it is semantically a terminal error, they should not constitute a forever condition.
Talos is not completely stuck there - e.g. changing an image reference to a correct one would cause a new attempt, but I don't think Talos should retry on its own in this condition.
No, it's not stuck in the sense that an external event (such as the restart of kubelet described above) is sufficient to get it unstuck... but this doesn't facilitate automation.
The issue with image availability is that it is not a fixed-over-time value. It is entirely reasonable to presume that the unavailability of an image is an unavailability yet of that image. It does beg the question of how often and under what conditions Talos should retry, but I would suggest doing the same thing that Kubernetes does in this condition: a bounded exponential backoff.
In this example, presuming all the current assumptions are true, it is the fault of the ghcr.io server to have misrepresented a 404, and a simple retry would have resolved the problem. A more common case, though, would be that the image is not yet available for some reason (slow mirror, new service which hasn't yet synced, out-of-order or non-deterministically-ordered deployment, etc). In all of these cases, a bounded exponential backoff retry strategy would automatically resolve the issue without causing undue load on the container registry even if the image never appears.
Still running into this issue:
user: warning: [2025-06-10T01:47:52.695925092Z]: level=info msg=fetch failed error=failed to do request: Head "https://ghcr.io/v2/siderolabs/kubelet/manifests/sha256:5bcfb0bb5643cac7161a6f72fc4f71d33a865f3049b99f3807f1b8d85f916bfe": dial tcp: lookup ghcr.io on 8.8.8.8:53: read udp 10.128.152.4:41986->8.8.8.8:53: i/o timeout host=ghcr.io image=ghcr.io/siderolabs/kubelet@sha256:5bcfb0bb5643cac7161a6f72fc4f71d33a865f3049b99f3807f1b8d85f916bfe
user: warning: [2025-06-10T01:47:52.752588092Z]: level=info msg=fetch failed after status: 404 Not Found host=ghcr.io image=ghcr.io/siderolabs/kubelet@sha256:5bcfb0bb5643cac7161a6f72fc4f71d33a865f3049b99f3807f1b8d85f916bfe
user: warning: [2025-06-10T01:47:52.756024092Z]: [talos] service[kubelet](Failed): Failed to run pre stage: 1 error(s) occurred:
user: warning: [2025-06-10T01:47:52.756636092Z]: failed to pull image "ghcr.io/siderolabs/kubelet@sha256:5bcfb0bb5643cac7161a6f72fc4f71d33a865f3049b99f3807f1b8d85f916bfe": failed to resolve reference "ghcr.io/siderolabs/kubelet@sha256:5bcfb0bb5643cac7161a6f72fc4f71d33a865f3049b99f3807f1b8d85f916bfe":
ghcr.io/siderolabs/kubelet@sha256:5bcfb0bb5643cac7161a6f72fc4f71d33a865f3049b99f3807f1b8d85f916bfe: not found
FYI I worked around it for this machine by running:
talosctl image pull --namespace system ghcr.io/siderolabs/kubelet@sha256:5bcfb0bb5643cac7161a6f72fc4f71d33a865f3049b99f3807f1b8d85f916bfe
But the workaround is a non-starter for dynamically created nodes.
@smira surely the pull should be retried when the correct DNS records are picked up?
@smira surely the pull should be retried when the correct DNS records are picked up?
Plus see that you get not found error, not DNS error. This is a fatal/terminal error. DNS error is retried.
Plus see that you get not found error, not DNS error. This is a fatal/terminal error. DNS error is retried.
I don't think there is a 404 being returned by the ghcr.io server; so it must be being synthesised by something in talos/containerd
Digging in a little, I think the info log is coming from https://github.com/containerd/containerd/blob/cb1076646aa3740577fafbf3d914198b7fe8e3f7/core/remotes/docker/resolver.go#L330
I note that there are several paths that set the result to errdefs.ErrNotFound
Is there a way to view debug level logs for containerd? (e.g. hoping to see https://github.com/containerd/containerd/blob/cb1076646aa3740577fafbf3d914198b7fe8e3f7/core/remotes/docker/resolver.go#L309C13-L309C15)
This is not coming from containerd process, but still the error Not Found is returned when the image is not present.
It might have been an issue with ghcr.io, but let's please keep this discussion in other issue/separate discussion.
This still seems to be relevant to the issue's title: whatever the cause of the not found condition, that should not be treated as a permanent error outside of a single download attempt. Even if this is coming from GHCR and it is correct at the time of the request, it is the nature of asynchronous systems to have delayed availability of the images. (In these cases, it is not that, because the images have been independently verified to exist long before the download attempt... but I'm trying to illustrate the fundamental fallacy of the current strategy.)
I can't reproduce this issue, and I do not have any explanation for HTTP 404, but I don't see how it can report 404 without actually getting HTTP 404.
Also this error never appears in our test environment, so I think there is something else in your config/environment which triggers it.
Nevertheless, we will add limited retries for 404s.