Image pull problems on Bottlerocket OS 1.49.0 (aws-k8s-1.34)
Image I'm using:
Bottlerocket OS 1.49.0 (aws-k8s-1.34)
What I expected to happen:
POD start
What actually happened:
When we upgraded our cluster to EKS 1.34 and Bottlerocket OS 1.49.0 (aws-k8s-1.34), sometimes docker image is not pulled and container / pod creation is stuck. In "sheltie", I did "ctr -n k8s.io image pull" of the problematic image and have seen, that pull of one image layer is stuck in the middle, progress bar was not moving. Containerd has setting image_pull_progress_timeout, that is not used in Bottlerocket and cannot be set by user. This setting has the following effect (from containerd sources):
// ImagePullProgressTimeout is the maximum duration that there is no
// image data read from image registry in the open connection. It will
// be reset whatever a new byte has been read. If timeout, the image
// pulling will be cancelled. A zero value means there is no timeout.
I think, that using this setting with some short time period like 10s can help with this issue.
How to reproduce the problem:
I was unable to simulate this problem on our development environment, affected environment worker nodes was downgraded to Kubernetes 1.33 variants.
Hi @rubroboletus , thanks for cutting us this issue.
We are tracking the new registry API config_path in this issue. It will provide configurable timeout like dial_timeout which may help your use case.
I have also opened an issue to track the image_pull_progress_timeout as a feature request - https://github.com/bottlerocket-os/bottlerocket/issues/4679
To compare 1.34 and 1.33 variant, the most obvious difference apart from the k8s versions are containerd versions (v2.0 vs v2.1), which is something to be mindful of. To help us reproduce this, do you have any minimal reproducible example that you can consistently trigger the issue?
Hi @ytsssun,
unfortunately this was happening on our testing environment, that is heavily used and I downgraded worker nodes to 1.33. When I was trying to reproduce this on our development environment, I did not succeed.
Hi @rubroboletus. Could you share some details about the image registry you are pulling against? Is it a custom image registry and do you use any pull through cache?
A previous observation in https://github.com/bottlerocket-os/bottlerocket/issues/4564 was that slow image registries make it seem like the image pull is stalled but it is still happening in the background. Would you be able to try and validate this idea in your test environment?
Hi @vigh-m, we are using selfhosted JFrog Artifactory 7.117.18. Images being pulled and reported in this issue are not pulled through cache, they are built locally and hosted directly in our AF docker registry. AF in in HA mode with 3 nodes.
Yesterday I have reproduced the issue, but with no exact steps, that can be used manually. We are undeploying all client workloads during night / weekend and deploying again every morning (helm get manifest APP | kubectl delete -) / (helm get manifest APP | kubectl create -). Yesterday I again upgraded out worker nodes to latest Bottlerocket (Bottlerocket OS 1.50.0 (aws-k8s-1.34)) and recreated all client workload. On two worker nodes there was 3 pods in state of ImagePullBackOff. This is part of kubectl describe
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 16m default-scheduler 0/25 nodes are available: 2 node(s) had untolerated taint {workload.mbid.cz/dedicated: jenkins}, 3 node(s) had untolerated taint {workload.mbid.cz/dedicated: maparm64}, 3 node(s) had untolerated taint {workload.mbid.cz/dedicated: map}, 4 Insufficient memory, 7 Insufficient cpu, 8 Insufficient vpc.amazonaws.com/pod-eni. no new claims to deallocate, preemption: 0/25 nodes are available: 17 No preemption victims found for incoming pod, 8 Preemption is not helpful for scheduling.
Warning FailedScheduling 15m (x5 over 16m) default-scheduler 0/25 nodes are available: 2 node(s) had untolerated taint {workload.mbid.cz/dedicated: jenkins}, 3 node(s) had untolerated taint {workload.mbid.cz/dedicated: maparm64}, 3 node(s) had untolerated taint {workload.mbid.cz/dedicated: map}, 4 Insufficient memory, 7 Insufficient cpu, 8 Insufficient vpc.amazonaws.com/pod-eni. no new claims to deallocate, preemption: 0/25 nodes are available: 17 No preemption victims found for incoming pod, 8 Preemption is not helpful for scheduling.
Warning FailedScheduling 15m (x4 over 15m) default-scheduler 0/26 nodes are available: 1 node(s) had untolerated taint {node.kubernetes.io/not-ready: }, 2 node(s) had untolerated taint {workload.mbid.cz/dedicated: jenkins}, 3 node(s) had untolerated taint {workload.mbid.cz/dedicated: maparm64}, 3 node(s) had untolerated taint {workload.mbid.cz/dedicated: map}, 4 Insufficient memory, 7 Insufficient cpu, 8 Insufficient vpc.amazonaws.com/pod-eni. no new claims to deallocate, preemption: 0/26 nodes are available: 17 No preemption victims found for incoming pod, 9 Preemption is not helpful for scheduling.
Warning FailedScheduling 15m default-scheduler 0/26 nodes are available: 2 node(s) had untolerated taint {workload.mbid.cz/dedicated: jenkins}, 3 node(s) had untolerated taint {workload.mbid.cz/dedicated: maparm64}, 3 node(s) had untolerated taint {workload.mbid.cz/dedicated: map}, 4 Insufficient memory, 7 Insufficient cpu, 9 Insufficient vpc.amazonaws.com/pod-eni. no new claims to deallocate, preemption: 0/26 nodes are available: 17 No preemption victims found for incoming pod, 9 Preemption is not helpful for scheduling.
Normal Scheduled 15m default-scheduler Successfully assigned optimus-tst-88/optimus-enterpriseit-get-loom-7ddf4c9f5f-mt7cp to ip-10-66-16-12.eu-central-1.compute.internal
Normal SecurityGroupRequested 15m vpc-resource-controller Pod will get the following Security Groups [sg-07ef6aac3262e1556 sg-052b2c7cee390115d]
Normal ResourceAllocated 15m vpc-resource-controller Allocated [{"eniId":"eni-0689f9a5d730cd2d5","ifAddress":"02:6d:98:fa:41:87","privateIp":"10.66.18.224","ipv6Addr":"","vlanId":21,"subnetCidr":"10.66.16.0/22","subnetV6Cidr":"","associationID":"trunk-assoc-7db9d236"}] to the pod
Normal Pulled 15m kubelet Container image "docker-master.docker.moneta-containers.net:8443/moneta.redhat-ubi9-minimal:latest" already present on machine
Normal Created 15m kubelet Created container: pod-init
Normal Started 15m kubelet Started container pod-init
Warning Failed 15m kubelet Failed to pull image "docker-test.docker.moneta-containers.net:8443/moneta/optimus/enterpriseit-get-loom:2.25.3-SNAPSHOT": pull QPS exceeded
Warning Failed 15m kubelet Error: ErrImagePull
Normal BackOff 15m kubelet Back-off pulling image "docker-test.docker.moneta-containers.net:8443/moneta/optimus/enterpriseit-get-loom:2.25.3-SNAPSHOT"
Warning Failed 15m kubelet Error: ImagePullBackOff
Normal Pulling 14m (x2 over 15m) kubelet Pulling image "docker-test.docker.moneta-containers.net:8443/moneta/optimus/enterpriseit-get-loom:2.25.3-SNAPSHOT"
so first pull exceeded QPS, that is ok, but the next ones failed.
We have the same when using our own self-hosted Harbor-registry, I wonder if the self-hosting part is relevant. On our 1.34 cluster, I downgraded to the k8s 1.33-1.40 AMI (which was the version we used before upgrading to 1.34) and the problem went away. I will now try with the 1.33-1.50 AMI, to see if it is related to the 1.50 version or the k8s 1.34 part.
We have the same when using our own self-hosted Harbor-registry, I wonder if the self-hosting part is relevant. On our 1.34 cluster, I downgraded to the k8s 1.33-1.40 AMI (which was the version we used before upgrading to 1.34) and the problem went away. I will now try with the 1.33-1.50 AMI, to see if it is related to the 1.50 version or the k8s 1.34 part.
for sure related to Bottlerocket 1.49.0 / 1.50.0 1.34 variant, there is containerd 2.1, in 1.33 is containerd 2.0 and they are pulling images using different way.
Hi @rubroboletus ! Thanks for sharing the details.
If you have premium support, would you be able to engage AWS support so that we can get containerd journal logs? If not, are you comfortable sharing them here?
Thanks!
containerd-2.1 did update CRI to use transfer service for image pull by default. In addition, from Bottlerocket 1.47.0 when we added the k8s-1.34 variants which added containerd-2.1 we made use of: concurrent-download-chunk-size (also aliased as: concurrent-layer-fetch-buffer) which is set at 8mib by default, more on the containerd setting here.
@rubroboletus This parallelization of concurrent layers could be triggering QPS. Logs would still be appreciated as @vigh-m mentioned. But in addition, if you could try testing with the setting set, to disable concurrent layer pull:
apiclient set settings.container-runtime.concurrent-download-chunk-size=0
Or via user-data:
[settings.container-runtime]
concurrent-download-chunk-size=0
Would be curious if this would help with your QPS issue.
Hi @vigh-m !
We already have a ticket on premium support - Case ID 176189972400683 . I hope, that there are all logs. If not, we can collect them and share with you.
Hello @KCSesh,
thank you, we can test it on weekend, because this is happening on our testing environment and it is heavily utilized in working days.
After 24hs. when applying settings.container-runtime.concurrent-download-chunk-size=0 with 1.34 and v1.50 ; it seems stable. Without that setting, stuck/failing containers happened after a few hours. Will monitor for a few more days before pushing to production.
I did not get the QPS error, just general errors pulling the images.
I can confirm, that setting:
[settings.container-runtime]
concurrent-download-chunk-size=0
probably solved our problem. I was unable to try this setting during weekend, tested it just now on latest Bottlerocket images - Bottlerocket OS 1.50.0 (aws-k8s-1.34).
@rubroboletus we have heard that updating JFrog Artifactory version also mitigates the issue, allowing for Range GET requests. Primarily tracking this here: https://github.com/bottlerocket-os/bottlerocket/issues/4709
@KCSesh thank you, but we are following release cycle of JFrog Artifactory so we have one of the latest releases all the time. Currently 7.125.6.