bottlerocket icon indicating copy to clipboard operation
bottlerocket copied to clipboard

Image pull problems on Bottlerocket OS 1.49.0 (aws-k8s-1.34)

Open rubroboletus opened this issue 1 month ago • 14 comments

Image I'm using:

Bottlerocket OS 1.49.0 (aws-k8s-1.34)

What I expected to happen:

POD start

What actually happened:

When we upgraded our cluster to EKS 1.34 and Bottlerocket OS 1.49.0 (aws-k8s-1.34), sometimes docker image is not pulled and container / pod creation is stuck. In "sheltie", I did "ctr -n k8s.io image pull" of the problematic image and have seen, that pull of one image layer is stuck in the middle, progress bar was not moving. Containerd has setting image_pull_progress_timeout, that is not used in Bottlerocket and cannot be set by user. This setting has the following effect (from containerd sources):

// ImagePullProgressTimeout is the maximum duration that there is no
// image data read from image registry in the open connection. It will
// be reset whatever a new byte has been read. If timeout, the image
// pulling will be cancelled. A zero value means there is no timeout.

I think, that using this setting with some short time period like 10s can help with this issue.

How to reproduce the problem:

I was unable to simulate this problem on our development environment, affected environment worker nodes was downgraded to Kubernetes 1.33 variants.

rubroboletus avatar Oct 31 '25 08:10 rubroboletus

Hi @rubroboletus , thanks for cutting us this issue.

We are tracking the new registry API config_path in this issue. It will provide configurable timeout like dial_timeout which may help your use case.

I have also opened an issue to track the image_pull_progress_timeout as a feature request - https://github.com/bottlerocket-os/bottlerocket/issues/4679

To compare 1.34 and 1.33 variant, the most obvious difference apart from the k8s versions are containerd versions (v2.0 vs v2.1), which is something to be mindful of. To help us reproduce this, do you have any minimal reproducible example that you can consistently trigger the issue?

ytsssun avatar Nov 03 '25 23:11 ytsssun

Hi @ytsssun,

unfortunately this was happening on our testing environment, that is heavily used and I downgraded worker nodes to 1.33. When I was trying to reproduce this on our development environment, I did not succeed.

rubroboletus avatar Nov 04 '25 07:11 rubroboletus

Hi @rubroboletus. Could you share some details about the image registry you are pulling against? Is it a custom image registry and do you use any pull through cache?

A previous observation in https://github.com/bottlerocket-os/bottlerocket/issues/4564 was that slow image registries make it seem like the image pull is stalled but it is still happening in the background. Would you be able to try and validate this idea in your test environment?

vigh-m avatar Nov 07 '25 23:11 vigh-m

Hi @vigh-m, we are using selfhosted JFrog Artifactory 7.117.18. Images being pulled and reported in this issue are not pulled through cache, they are built locally and hosted directly in our AF docker registry. AF in in HA mode with 3 nodes.

Yesterday I have reproduced the issue, but with no exact steps, that can be used manually. We are undeploying all client workloads during night / weekend and deploying again every morning (helm get manifest APP | kubectl delete -) / (helm get manifest APP | kubectl create -). Yesterday I again upgraded out worker nodes to latest Bottlerocket (Bottlerocket OS 1.50.0 (aws-k8s-1.34)) and recreated all client workload. On two worker nodes there was 3 pods in state of ImagePullBackOff. This is part of kubectl describe

Events:
  Type     Reason                  Age                From                     Message
  ----     ------                  ----               ----                     -------
  Warning  FailedScheduling        16m                default-scheduler        0/25 nodes are available: 2 node(s) had untolerated taint {workload.mbid.cz/dedicated: jenkins}, 3 node(s) had untolerated taint {workload.mbid.cz/dedicated: maparm64}, 3 node(s) had untolerated taint {workload.mbid.cz/dedicated: map}, 4 Insufficient memory, 7 Insufficient cpu, 8 Insufficient vpc.amazonaws.com/pod-eni. no new claims to deallocate, preemption: 0/25 nodes are available: 17 No preemption victims found for incoming pod, 8 Preemption is not helpful for scheduling.
  Warning  FailedScheduling        15m (x5 over 16m)  default-scheduler        0/25 nodes are available: 2 node(s) had untolerated taint {workload.mbid.cz/dedicated: jenkins}, 3 node(s) had untolerated taint {workload.mbid.cz/dedicated: maparm64}, 3 node(s) had untolerated taint {workload.mbid.cz/dedicated: map}, 4 Insufficient memory, 7 Insufficient cpu, 8 Insufficient vpc.amazonaws.com/pod-eni. no new claims to deallocate, preemption: 0/25 nodes are available: 17 No preemption victims found for incoming pod, 8 Preemption is not helpful for scheduling.
  Warning  FailedScheduling        15m (x4 over 15m)  default-scheduler        0/26 nodes are available: 1 node(s) had untolerated taint {node.kubernetes.io/not-ready: }, 2 node(s) had untolerated taint {workload.mbid.cz/dedicated: jenkins}, 3 node(s) had untolerated taint {workload.mbid.cz/dedicated: maparm64}, 3 node(s) had untolerated taint {workload.mbid.cz/dedicated: map}, 4 Insufficient memory, 7 Insufficient cpu, 8 Insufficient vpc.amazonaws.com/pod-eni. no new claims to deallocate, preemption: 0/26 nodes are available: 17 No preemption victims found for incoming pod, 9 Preemption is not helpful for scheduling.
  Warning  FailedScheduling        15m                default-scheduler        0/26 nodes are available: 2 node(s) had untolerated taint {workload.mbid.cz/dedicated: jenkins}, 3 node(s) had untolerated taint {workload.mbid.cz/dedicated: maparm64}, 3 node(s) had untolerated taint {workload.mbid.cz/dedicated: map}, 4 Insufficient memory, 7 Insufficient cpu, 9 Insufficient vpc.amazonaws.com/pod-eni. no new claims to deallocate, preemption: 0/26 nodes are available: 17 No preemption victims found for incoming pod, 9 Preemption is not helpful for scheduling.
  Normal   Scheduled               15m                default-scheduler        Successfully assigned optimus-tst-88/optimus-enterpriseit-get-loom-7ddf4c9f5f-mt7cp to ip-10-66-16-12.eu-central-1.compute.internal
  Normal   SecurityGroupRequested  15m                vpc-resource-controller  Pod will get the following Security Groups [sg-07ef6aac3262e1556 sg-052b2c7cee390115d]
  Normal   ResourceAllocated       15m                vpc-resource-controller  Allocated [{"eniId":"eni-0689f9a5d730cd2d5","ifAddress":"02:6d:98:fa:41:87","privateIp":"10.66.18.224","ipv6Addr":"","vlanId":21,"subnetCidr":"10.66.16.0/22","subnetV6Cidr":"","associationID":"trunk-assoc-7db9d236"}] to the pod
  Normal   Pulled                  15m                kubelet                  Container image "docker-master.docker.moneta-containers.net:8443/moneta.redhat-ubi9-minimal:latest" already present on machine
  Normal   Created                 15m                kubelet                  Created container: pod-init
  Normal   Started                 15m                kubelet                  Started container pod-init
  Warning  Failed                  15m                kubelet                  Failed to pull image "docker-test.docker.moneta-containers.net:8443/moneta/optimus/enterpriseit-get-loom:2.25.3-SNAPSHOT": pull QPS exceeded
  Warning  Failed                  15m                kubelet                  Error: ErrImagePull
  Normal   BackOff                 15m                kubelet                  Back-off pulling image "docker-test.docker.moneta-containers.net:8443/moneta/optimus/enterpriseit-get-loom:2.25.3-SNAPSHOT"
  Warning  Failed                  15m                kubelet                  Error: ImagePullBackOff
  Normal   Pulling                 14m (x2 over 15m)  kubelet                  Pulling image "docker-test.docker.moneta-containers.net:8443/moneta/optimus/enterpriseit-get-loom:2.25.3-SNAPSHOT"

so first pull exceeded QPS, that is ok, but the next ones failed.

rubroboletus avatar Nov 10 '25 06:11 rubroboletus

We have the same when using our own self-hosted Harbor-registry, I wonder if the self-hosting part is relevant. On our 1.34 cluster, I downgraded to the k8s 1.33-1.40 AMI (which was the version we used before upgrading to 1.34) and the problem went away. I will now try with the 1.33-1.50 AMI, to see if it is related to the 1.50 version or the k8s 1.34 part.

vincentjanv avatar Nov 10 '25 13:11 vincentjanv

We have the same when using our own self-hosted Harbor-registry, I wonder if the self-hosting part is relevant. On our 1.34 cluster, I downgraded to the k8s 1.33-1.40 AMI (which was the version we used before upgrading to 1.34) and the problem went away. I will now try with the 1.33-1.50 AMI, to see if it is related to the 1.50 version or the k8s 1.34 part.

for sure related to Bottlerocket 1.49.0 / 1.50.0 1.34 variant, there is containerd 2.1, in 1.33 is containerd 2.0 and they are pulling images using different way.

rubroboletus avatar Nov 10 '25 13:11 rubroboletus

Hi @rubroboletus ! Thanks for sharing the details.

If you have premium support, would you be able to engage AWS support so that we can get containerd journal logs? If not, are you comfortable sharing them here?

Thanks!

vigh-m avatar Nov 10 '25 22:11 vigh-m

containerd-2.1 did update CRI to use transfer service for image pull by default. In addition, from Bottlerocket 1.47.0 when we added the k8s-1.34 variants which added containerd-2.1 we made use of: concurrent-download-chunk-size (also aliased as: concurrent-layer-fetch-buffer) which is set at 8mib by default, more on the containerd setting here.

@rubroboletus This parallelization of concurrent layers could be triggering QPS. Logs would still be appreciated as @vigh-m mentioned. But in addition, if you could try testing with the setting set, to disable concurrent layer pull:

apiclient set settings.container-runtime.concurrent-download-chunk-size=0

Or via user-data:

[settings.container-runtime]
concurrent-download-chunk-size=0

Would be curious if this would help with your QPS issue.

KCSesh avatar Nov 10 '25 23:11 KCSesh

Hi @vigh-m !

We already have a ticket on premium support - Case ID 176189972400683 . I hope, that there are all logs. If not, we can collect them and share with you.

rubroboletus avatar Nov 11 '25 05:11 rubroboletus

Hello @KCSesh,

thank you, we can test it on weekend, because this is happening on our testing environment and it is heavily utilized in working days.

rubroboletus avatar Nov 11 '25 06:11 rubroboletus

After 24hs. when applying settings.container-runtime.concurrent-download-chunk-size=0 with 1.34 and v1.50 ; it seems stable. Without that setting, stuck/failing containers happened after a few hours. Will monitor for a few more days before pushing to production.

I did not get the QPS error, just general errors pulling the images.

vincentjanv avatar Nov 13 '25 11:11 vincentjanv

I can confirm, that setting:

[settings.container-runtime]
concurrent-download-chunk-size=0

probably solved our problem. I was unable to try this setting during weekend, tested it just now on latest Bottlerocket images - Bottlerocket OS 1.50.0 (aws-k8s-1.34).

rubroboletus avatar Nov 18 '25 21:11 rubroboletus

@rubroboletus we have heard that updating JFrog Artifactory version also mitigates the issue, allowing for Range GET requests. Primarily tracking this here: https://github.com/bottlerocket-os/bottlerocket/issues/4709

KCSesh avatar Nov 24 '25 21:11 KCSesh

@KCSesh thank you, but we are following release cycle of JFrog Artifactory so we have one of the latest releases all the time. Currently 7.125.6.

rubroboletus avatar Nov 25 '25 05:11 rubroboletus