bottlerocket icon indicating copy to clipboard operation
bottlerocket copied to clipboard

Bottlerocket Image Pull Issues on aws-k8s-1.34* and aws-k8s-1.33* variants.

Open KCSesh opened this issue 1 month ago โ€ข 24 comments

Our latest release of Bottlerocket 1.51.0 moves aws-k8s-1.33* to use containerd-2.1 due to the EOL of containerd-2.0 and aws-k8s-1.34* variants has been on containerd-2.1 since launch.

containerd-2.1 defaults to a new imagePull flow called transfer service. On top of that, Bottlerocket make use of concurrent-download-chunk-size (also aliased as concurrent-layer-fetch-buffer), which is set to 8MiB by default. More details on this containerd setting can be found here.

If you are experiencing imagePull problems it would likely be related to containerd's new image-pull flow with parallelization.

If possible in your environment if you are able to reproduce, gathering additional debug logs from containerd service would be helpful so we can forward this to the containerd team for investigation. To enable debug logs on containerd follow the steps commented below.

Additionally, if you do not want to share logs in this ticket, and have access to AWS support, feel free to go through them.


Mitigation

To mitigate problems with imagePull, a possible solution would be to disable the parallelization of image pull. You can do this by doing the following:

apiclient set settings.container-runtime.concurrent-download-chunk-size=0

Or via user-data:

[settings.container-runtime]
concurrent-download-chunk-size=0

Known Issues:

  • Artifactory/JFrog image pull has a few reported issues.

    • All Artifactory issues have been resolved with the mitigation.
    • We have also heard reports that updating Artifactory/JFrog to the latest fixes the issues!
  • 1 possible race condition reported - attempting to find a reproduction

  • pull QPS exceeded

    • An example of hitting this issue seems so far to be related to pulling from Dockerhub unauthorized - hitting their throttling limits faster than normal. Disabling parallelization following the mitigation should help.

Related tickets:

  • https://github.com/bottlerocket-os/bottlerocket/issues/4677
  • https://github.com/bottlerocket-os/bottlerocket/issues/4707

KCSesh avatar Nov 18 '25 19:11 KCSesh

Optionally Enable Containerd Debug logs for log gathering

For clarity again: These steps are not recommended for general use and are specifically for single-node debugging purposes only. These changes are temporary and will be overwritten in many scenarios. If you are willing to do the following, thank you!

Temporarily Modifying Containerd Configuration for Debug Logs

From the admin container:

  1. Modify containerd config:
cd /.bottlerocket/rootfs/etc/containerd/
cp config.toml config.toml.backup
vi config.toml
  1. As described by containerd, add to config.toml the following:
[debug]
level = "debug"
  1. Save the changes :wq

  2. From admin container:

sheltie
  1. Restart service:
systemctl restart containerd
  1. View logs and verify debug:
journalctl -u containerd -f
  1. Debug

  2. Restore: Either restore the backup + restart containerd or delete the changes + restart containerd, or simply restart the node.

KCSesh avatar Nov 18 '25 19:11 KCSesh

Hi @KCSesh !

Weโ€™re experiencing the same problem after upgrading our EKS clusters to use Bottlerocket AMI v1.51.0, 1.33 version and ContainerD 2.1.

Observed behavior

When the image has Docker MediaType, the pull succeeds. When the image uses OCI MediaType, the pull fails.

Questions:

  • Is this the expected behavior for ContainerD 2.1 in BottleRocket v1.51.0?
  • Are there any official AWS recommendations for mitigating this impact?
  • Do you know if other customers are facing similar issues after upgrading to this AMI?

Weโ€™d appreciate any guidance or best practices to handle this scenario. Happy to provide more logs or details if needed.

Thanks for your support!

rdglinux avatar Nov 19 '25 14:11 rdglinux

Hey @rdglinux!

When the image has Docker MediaType, the pull succeeds. When the image uses OCI MediaType, the pull fails.

  1. No I wouldn't consider that expected. If you could share logs or a reproducer that would be great! We can follow up with containerd if we confirm it is an issue on their side!

2 & 3: So far we are monitoring the 1.51.0 release rollout and relating associated mitigations and issues found there!

KCSesh avatar Nov 19 '25 17:11 KCSesh

I believe this has to do with the new Docker version 29 that was released last week. I was fighting some issues with my builds and found that we were unable to deploy Docker images to lambda. We were able to roll back all our builders to the latest version of Docker engine v28 and were able to build.

Currently I'm dealing with an Artifactory instance (v7.90.6) in our EKS cluster that acts as a pull-through cache for our images. We are unable to pull images from Artifactory on any nodes that are bottle rocket version 1.51.0 EKS version v1.33.5-eks-ba24e9c. I noticed that this version is the first one to use v29 of the Docker engine. I believe that could be contributing to the issue.

gerald-pinder-omnicell avatar Nov 20 '25 15:11 gerald-pinder-omnicell

@gerald-pinder-omnicell our eks variants don't ship with Docker, that would only be found on our ecs variants.

If you have any logs, errors or a reproduction of the issue those would be great to see!

KCSesh avatar Nov 20 '25 16:11 KCSesh

our eks variants don't ship with Docker, that would only be found on our ecs variants.

Sorry, I saw the release that mentioned v29. Didn't realize that was for ECS.

If you have any logs, errors or a reproduction of the issue those would be great to see!

Unfortunately, we had to quickly revert to 1.50.0 to keep production traffic up, so I didn't get the chance to grab logs. Reverting back to the previous version did help though.

gerald-pinder-omnicell avatar Nov 20 '25 16:11 gerald-pinder-omnicell

I did see this event though (url changed for security)

ImagePullBackOff (Back-off pulling image "artifactory.example.com/dockerhub/multiarch/qemu-user-static": ErrImagePull: failed to pull and unpack image "artifactory.example.com/dockerhub/multiarch/qemu-user-static:latest": failed to copy: httpReadSeeker: failed open: unexpected status from GET request to https://artifactory.example.com/v2/dockerhub/multiarch/qemu-user-static/manifests/sha256:fe60359c92e86a43cc87b3d906006245f77bfc0565676b80004cc666e4feb9f0: 400 Bad Request)

This is pulling from Artifactory (v7.90.6). My team is going to have to take steps to slowly roll out the new bottle rocket version.

gerald-pinder-omnicell avatar Nov 20 '25 16:11 gerald-pinder-omnicell

We are also using Artifactory 7.111.10 and ran into the ImagePullBackOff issues. We had to pin our EC2NodeClass in Karpenter in our EKS 1.33 clusters to v1.50.0 to prevent nodes from coming up that would never be usable.

dshackith avatar Nov 20 '25 16:11 dshackith

@dshackith Are you also getting 400 Bad Request on ImagePullBackOff?

KCSesh avatar Nov 20 '25 17:11 KCSesh

While we were going through our Prod clusters, we noticed that images our company built and stored in Artifactory were able to be pulled down onto nodes running 1.51.0. However, images going through the pull-through cache (to hub.docker.com) didn't work. We still continued with the rollback to keep things stable though.

gerald-pinder-omnicell avatar Nov 20 '25 17:11 gerald-pinder-omnicell

Interesting @gerald-pinder-omnicell Thanks for the additional info.

I found: https://github.com/containerd/containerd/issues/11953 Which seems semi-related, maybe inline with:

images our company built and stored in Artifactory were able to be pulled down onto nodes

But it sounds like there may be an issue still with pull through-cahce.

images going through the pull-through cache (to hub.docker.com) didn't work.

I assume your Bottlerocket nodes have something like this then:

[settings.container-registry.mirrors]
registry = "docker.io"
endpoint = ["https://<artifactory-host>/artifactory/<repository-key-name>"]

Update : Removed double bracket on settings.container-registry.mirrors ๐Ÿ‘

KCSesh avatar Nov 20 '25 17:11 KCSesh

Interesting @gerald-pinder-omnicell Thanks for the additional info.

I found: containerd/containerd#11953 Which seems semi-related, maybe inline with:

images our company built and stored in Artifactory were able to be pulled down onto nodes

But it sounds like there may be an issue still with pull through-cahce.

images going through the pull-through cache (to hub.docker.com) didn't work.

I assume your Bottlerocket nodes have something like this then:

[[settings.container-registry.mirrors]]
registry = "docker.io"
endpoint = ["https://<artifactory-host>/artifactory/<repository-key-name>"]

I totally didn't know about this functionality, thanks for clueing me in on this! No, we were just manually setting images in our helm charts to use our pull-through cache instead. This will definitely be something we setup in the near future.

gerald-pinder-omnicell avatar Nov 20 '25 18:11 gerald-pinder-omnicell

@KCSesh On the nodes, we were seeing messages like the following:

Failed to pull image "artifactory.example.org/eks/amazon-k8s-cni-init:v1.20.0": rpc error: code = NotFound desc = failed to pull and unpack image "artifactory.example.org/eks/amazon-k8s-cni-init:v1.20.0": failed to copy: httpReadSeeker: failed open: content at https://artifactory.example.org/v2/eks/amazon-k8s-cni-init/manifests/sha256:04741e35763093a3b19135fb482463dec94209696c665862b2ffad8017e7c8f9 not found: not found

dshackith avatar Nov 20 '25 18:11 dshackith

@dshackith That also seems to be setup as pull through cache, is that right?

KCSesh avatar Nov 20 '25 18:11 KCSesh

@KCSesh Yes, exactly. All our manifests reference the Artifactory instance and we heavily use the pull through cache.

dshackith avatar Nov 20 '25 18:11 dshackith

I have setup Artifactory and ran some experiements with pull through cache on both docker-hub and public-ecr, and haven't been able to reproduce, all using the default 8mib setup. I'll try and keep poking at this.

I am curious if @dshackith or @gerald-pinder-omnicell if you have tried setting:

settings.container-runtime.concurrent-download-chunk-size=0

And re-tested?

KCSesh avatar Nov 21 '25 23:11 KCSesh

before settings.container-runtime.concurrent-download-chunk-size=0:

Successfully pulled image "docker.artifactory.example.com/foo/config-sidecar:v0.3.0" ... Image size: 28144008 bytes.

Failed to pull image "docker.artifactory.example.com/bar/sth:v1.0.0": rpc error: code = NotFound desc = failed to pull and unpack image "docker.artifactory.example.com/bar/sth:v1.0.0": failed to copy: httpReadSeeker: failed open: content at https://docker.artifactory.example.com/v2/bar/sth/manifests/sha256:1234abcd.. not found: not found

after settings.container-runtime.concurrent-download-chunk-size=0:

Successfully pulled image "docker.artifactory.example.com/foo/config-sidecar:v0.3.0" ... Image size: 28144008 bytes.

Successfully pulled image "docker.artifactory.example.com/bar/sth:v1.0.0" ... Image size: 121159593 bytes.

@KCSesh , do you recommend the settings.container-runtime.concurrent-download-chunk-size=0 workaround while the issue being investigated? and can it be set programmatically/using IaC?

angabini avatar Nov 22 '25 21:11 angabini

@angabini thanks for providing some of the info. Seems to match the behavior the others might be describing, where some pulls from Artifactory are working, and others (specifically pull through cache) are not.

So far I have primarily only heard issues on Artifactory, and am trying to gather more logs and create a repro! The current understanding is that Artifactory server, might not be handling the client requests from containerd properly - but more logs and repro will help enforce this hypothesis.

do you recommend the settings.container-runtime.concurrent-download-chunk-size=0 workaround while the issue being investigated?

I would recommend setting the value - as the investigation continues. To again restate, this setting enables parallel layer fetch which helps optimize image pull times. Setting it to 0 simply removes this optimization.

and can it be set programmatically/using IaC?

Yes, Bottlerocket settings can be set programmatically through IaC. Usually done through user-data, here are a few random/popular links:

KCSesh avatar Nov 24 '25 18:11 KCSesh

@dshackith @gerald-pinder-omnicell @angabini seems an update to JFrog Artifactory version (bug fix server side) may also mitigate the issue, allowing for Range GET requests which explains why I haven't been able to reproduce.

If that is an option for any of you, I would be interested in hearing results!

KCSesh avatar Nov 24 '25 21:11 KCSesh

Ok so we're pretty sure we're running into this issue or some related one as well. This is manifesting in pods that are staying in ContainerCreating for hours with their last event being Pulling image "...". They do not ever throw an error, interestingly enough.

We have observed this for 2 different registries: AWS ECR with images hosted in that registry directly (no pull through) and docker.io either directly or via a Harbor pull-through cache.

These pulls do not finish. I have some that have been going on for hours. As soon as I force-delete the pod and get it re-scheduled on a different node the pull works fine. If it re-schedules on the same node the pulls do not work. I have AWS SSM access to the nodes and I tried to collect debug logs, unfortunately the steps to enable the debug logs involve restarting containerd which apparently fixes whatever the problem was. Things get unstuck and the images are pulled and the pods start.

I have so far not found a way to provoke/reproduce this behaviour. The only thing that everything has in common so far is that the nodes that are affected are running containerd 2.1.5+bottlerocket, Kubelet v1.33.5-eks-ba24e9c, Bottlerocket OS 1.51.0 (aws-k8s-1.33)

I have some non-debug logs from a node that I collected earlier, also happy to share a bigger version of this in some different manner. We'll try the settings.container-runtime.concurrent-download-chunk-size=0 proposal now in hopes of mitigating this issue as it requires manual intervention to unstuck the pods/nodes so it's not great.

Some logs from journal

Nov 27 09:01:00 node-01-prod containerd[1924]: time="2025-11-27T09:01:00.758792420Z" level=info msg="PullImage \"our.ecr.us-east-1.amazonaws.com/repository:image-tag-01@sha256:digest-01\""
Nov 27 09:01:00 node-01-prod containerd[1924]: time="2025-11-27T09:01:00.973163627Z" level=info msg="remote host ignored content range, forcing parallelism to 1" digest="sha256:digest-01" error="content range requests ignored" mediatype=application/vnd.oci.image.index.v1+json size=1609
Nov 27 10:04:02 node-01-prod containerd[1924]: time="2025-11-27T10:04:02.349886969Z" level=info msg="stop pulling image our.ecr.us-east-1.amazonaws.com/repository@sha256:digest-01: active requests=0, bytes read=39112"

Nov 27 08:39:00 node-01-prod containerd[1924]: time="2025-11-27T08:39:00.775581711Z" level=info msg="PullImage \"our.ecr.us-east-1.amazonaws.com/repository:image-tag-02@sha256:digest-02\""
Nov 27 08:39:00 node-01-prod containerd[1924]: time="2025-11-27T08:39:00.976264704Z" level=info msg="remote host ignored content range, forcing parallelism to 1" digest="sha256:digest-02" error="content range requests ignored" mediatype=application/vnd.oci.image.index.v1+json size=1609
Nov 27 10:04:02 node-01-prod containerd[1924]: time="2025-11-27T10:04:02.349860889Z" level=info msg="stop pulling image our.ecr.us-east-1.amazonaws.com/repository@sha256:digest-02: active requests=0, bytes read=39115"

10:04 being the time I applied the debug logging settings and had to restart containerd

langesven avatar Nov 27 '25 11:11 langesven

@langesven I'll be curious to hear updates if settings.container-runtime.concurrent-download-chunk-size=0 helps.

As soon as I force-delete the pod and get it re-scheduled on a different node the pull works fine.

Regarding the new node it lands on, is it also running Bottlerocket 1.51.0 or a different version/os?

Also, once containerd has been restarted, do other pod deployments seem to function normally?

KCSesh avatar Dec 02 '25 00:12 KCSesh

Heya, sorry I did mean to get back to this sooner actually ๐Ÿ˜…

settings.container-runtime.concurrent-download-chunk-size=0 did help, we have not observed this issue anymore since configuring this!

Regarding the new node it lands on, is it also running Bottlerocket 1.51.0 or a different version/os?

Exact same config as the non-working node. It's the same everything. Just works.

Also, once containerd has been restarted, do other pod deployments seem to function normally?

Yeah everything does. Even the things that had the pull errors just started working as if nothing was ever wrong...

Some things that I did notice while working on this. We only ever saw the image pull errors for the same sort of images. They are quite big (~8-10GB). This also appeared to happen mostly in "clusters". Like, there'd be 3 pods stuck on the same node, rather than 3 pods stuck on 3 different nodes. I'm wondering if maybe the size of the image + chunking + 3 pulls in parallel for big images are sort of responsible here?

I wanted to get you some debug logs but I've failed to figure out how to make Karpenter spawn the BottleRocket nodes with containerd debug logging enabled. Is there a way to do this? I can not actively provoke this error but I'm sure if I let one nodepool run with the previous config it will happen sooner or later ๐Ÿ˜„

I did notice when I restarted containerd on a node that had this issue it logged that it stopped 8 pull processes for 5 different images. At the time of the restart I only saw 3 or 4 pods having this issue with 2 different images.

level=info msg="stop pulling image image1: active requests=0, bytes read=39117"
level=info msg="stop pulling image image2: active requests=0, bytes read=39115"
level=info msg="stop pulling image image3: active requests=0, bytes read=39115"
level=info msg="stop pulling image image1: active requests=0, bytes read=39117"
level=info msg="stop pulling image image4: active requests=0, bytes read=39112"
level=info msg="stop pulling image image1: active requests=0, bytes read=2635105536"
level=info msg="stop pulling image image1: active requests=0, bytes read=1735361159"
level=info msg="stop pulling image image5: active requests=0, bytes read=39118"
level=info msg="Stop CRI service"
level=info msg="Stop CRI service"
level=info msg="Event monitor stopped"
level=info msg="Stream server stopped"

langesven avatar Dec 02 '25 10:12 langesven

@langesven thanks for the update! Happy to hear : settings.container-runtime.concurrent-download-chunk-size=0 at least helps mitigate the issue you are seeing.

Thanks for the detail, that definitely sounds odd, and I would agree with your assessment:

I'm wondering if maybe the size of the image + chunking + 3 pulls in parallel for big images are sort of responsible here?

I will try and poke at this - and try and create a reproduction using several large images in parallel. If I (or we) can find a repro, I can share it with the containerd folks - and hopefully find a solution.

A bit odd that rescheduling on a different node, or simply restarting fixes it, but that could just point to a race-condition that may exist with the combination you mentioned.

KCSesh avatar Dec 02 '25 18:12 KCSesh

I can confirm that the suggested workaround has resolved the issue for us. We primarily had an issue with ArgoCD images when they were pulled; these are also relatively large.

Adding this to the EC2NodeClass

  userData: |
    [settings.container-runtime]
    concurrent-download-chunk-size=0

sbe-famly avatar Dec 10 '25 08:12 sbe-famly

As linked in: https://github.com/bottlerocket-os/bottlerocket-core-kit/pull/764 We are going to unwind setting the default chunk-size default in our next release, from 8mib to 0.

We can re-evaluate this decision when containerd 2.2 gets shipped.

Will keep this open until our next release finishes roll out: https://github.com/bottlerocket-os/bottlerocket/issues/4723

KCSesh avatar Dec 15 '25 20:12 KCSesh