zarf
zarf copied to clipboard
Zarf fails when pulling from Nvidia's container registry
Environment
Device and OS: Tested with Ubuntu 22.04 and Windows 11, AMD64 App version: v0.32.6
Steps to reproduce
- Create a zarf.yaml with a component that includes Nvidia images from nvcr.io
- Run
zarf package create --confirm
This is the simplest zarf.yaml that I can get the error with, but it is not 100% consistent:
kind: ZarfPackageConfig
metadata:
name: test-package
version: 1.0.0
components:
- name: gpu-operator
required: true
charts:
- name: gpu-operator
namespace: gpu-operator
url: https://helm.ngc.nvidia.com/nvidia
version: v23.9.2
images:
- nvcr.io/nvidia/cloud-native/k8s-driver-manager:v0.6.5
My usual yaml has the following component, which fails 100% of the time. Seems like the more images there are, higher the chance it will fail:
- name: gpu-operator
required: true
charts:
- name: gpu-operator
namespace: gpu-operator
url: https://helm.ngc.nvidia.com/nvidia
version: v23.9.2
valuesFiles:
- ../k8s/base/gpu-operator-values.yaml
images:
- registry.k8s.io/nfd/node-feature-discovery:v0.14.2
- nvcr.io/nvidia/dcgm-exporter:3.3.0-3.2.0-ubuntu22.04
- nvcr.io/nvidia/gpu-feature-discovery:v0.8.2-ubi8
- nvcr.io/nvidia/k8s-device-plugin:v0.14.5-ubi8
- nvcr.io/nvidia/k8s/container-toolkit:v1.14.6-ubuntu20.04
- nvcr.io/nvidia/gpu-operator:v23.9.2
- nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.2
- nvcr.io/nvidia/cloud-native/k8s-driver-manager:v0.6.5
For completeness' sake, the contents of `../k8s/base/gpu-operator-values.yaml" are:
toolkit:
env:
- name: CONTAINERD_CONFIG
value: "/var/lib/rancher/k3s/agent/etc/containerd/config.toml"
- name: CONTAINERD_SOCKET
value: "/run/k3s/containerd/containerd.sock"
- name: CONTAINERD_RUNTIME_CLASS
value: "nvidia"
- name: CONTAINERD_SET_AS_DEFAULT
value: "true"
Expected result
It should pull the images normally and continue with the package creation.
Actual Result
It fails at the Loading metadata for n images
stage. It retries 2 more times but that always results in expected blob size x, but only wrote 23
. This also saves a corrupted cache, so once it fails this way, it will always fail on new runs until you clean the cache.
Visual Proof (screenshots, videos, text, etc)
It writes one of two error messages randomly, but each is about an INTERNAL_ERROR
.
1:
📦 PACKAGE IMAGES
✔ Loading metadata for 59 images. This step may take a couple of minutes to complete.
WARNING Failed to write image layers, trying again up to 3 times...
WARNING Retrying (1/3): Get
"https://ngc.download.nvidia.com/containers/registry//docker/registry/v2/blobs/sha256/52/520797292d9250932259d95f471bef1f97712030c1d364f3f297260e5fee1de8/data?ak-token=exp=1711620178~acl=/containers/registry/docker/registry/v2/blobs/sha256/52/520797292d9250932259d95f471bef1f97712030c1d364f3f297260e5fee1de8/data*~hmac=0070c698b83e09d9915563609903e02bd5932dc33b94eaa6fe81605963afb363":
stream error: stream ID 31; INTERNAL_ERROR; received from peer
✔ Loading metadata for 59 images. This step may take a couple of minutes to complete.
WARNING Failed to write image layers, trying again up to 3 times...
WARNING Potential image cache corruption: expected blob size 188, but only wrote 23 - try clearing
cache with "zarf tools clear-cache"
WARNING Retrying (2/3): expected blob size 188, but only wrote 23
✔ Loading metadata for 59 images. This step may take a couple of minutes to complete.
WARNING Failed to write image layers, trying again up to 3 times...
WARNING Potential image cache corruption: expected blob size 2030, but only wrote 23 - try
clearing cache with "zarf tools clear-cache"
WARNING Retrying (3/3): expected blob size 2030, but only wrote 23
ERROR: Failed to create package: unable to pull images after 3 attempts: expected blob size 2030, but only
wrote 23
2:
📦 PACKAGE IMAGES
✔ Loading metadata for 59 images. This step may take a couple of minutes to complete.
WARNING Failed to write image layers, trying again up to 3 times...
WARNING Retrying (1/3) in 5s: stream error: stream ID 89; INTERNAL_ERROR; received from peer
✔ Loading metadata for 59 images. This step may take a couple of minutes to complete.
WARNING Failed to write image layers, trying again up to 3 times...
WARNING Retrying (2/3) in 10s: remove
C:\Users\ercan_c11zstp\.zarf-cache\images\sha256-e769c9462d1bfeb130b57c84903eab0c2d8a25298bac4543f04b78adad5414ae:
The process cannot access the file because it is being used by another process.
✔ Loading metadata for 59 images. This step may take a couple of minutes to complete.
WARNING Failed to write image layers, trying again up to 3 times...
WARNING Potential image cache corruption: expected blob size 6884, but only wrote 23 - try
clearing cache with "zarf tools clear-cache"
WARNING Retrying (3/3) in 20s: expected blob size 6884, but only wrote 23
ERROR: Failed to create package: unable to pull images after 3 attempts: expected blob size 6884, but only
wrote 23
Severity/Priority
Very severe, because it completely blocks us from being able to create (and deploy) our package, which needs Nvidia GPU functionality to work.
Additional Context
This has been tested on multiple PCs/servers on different networks and throughout multiple days. I don't think it could be a temporary hickup or getting limited by nvcr.io. Running docker pull
on the images can pull the images just fine. Even if it fails when downloading a layer, it should not corrupt the layer cache, it should instead be able to fix this on the retry stage.
@ercanserteli I am not able to reproduce the error you're seeing.
Using this zarf.yaml
:
kind: ZarfPackageConfig
metadata:
name: test-package
version: 1.0.0
components:
- name: gpu-operator
required: true
charts:
- name: gpu-operator
namespace: gpu-operator
url: https://helm.ngc.nvidia.com/nvidia
version: v23.9.2
valuesFiles:
- ./values.yaml
images:
- registry.k8s.io/nfd/node-feature-discovery:v0.14.2
- nvcr.io/nvidia/k8s/dcgm-exporter:3.3.0-3.2.0-ubuntu22.04
- nvcr.io/nvidia/gpu-feature-discovery:v0.8.2-ubi8
- nvcr.io/nvidia/k8s-device-plugin:v0.14.5-ubi8
- nvcr.io/nvidia/k8s/container-toolkit:v1.14.6-ubuntu20.04
- nvcr.io/nvidia/gpu-operator:v23.9.2
- nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.2
- nvcr.io/nvidia/cloud-native/k8s-driver-manager:v0.6.5
note that I had to add k8s
to this image reference: nvcr.io/nvidia/k8s/dcgm-exporter:3.3.0-3.2.0-ubuntu22.04
https://catalog.ngc.nvidia.com/orgs/nvidia/teams/k8s/containers/dcgm-exporter/tags
In the output you provided, it says Loading metadata for 59 images
but the example zarf.yaml
you provided only has 8 images. There seems to be a lot more images that your package has defined that zarf is trying to pull. Could you provide the zarf.yaml
that's being used that has 59 images?
After waiting a few hours and trying again, I am seeing the error now. I suspect this is an issue related to NVIDIA's registry somehow as we have not seen this problem occur with other registries that I'm aware of.
You are right, the sample outputs I added were from running with the whole production zarf.yaml
with more images overall, but as you confirmed the problem occurs even with only this component. I also do not see this problem with any other registry than Nvidia's. But it may be that the problem's occurrence rate increases when there are more images to be pulled overall. I have a 100% failure rate with the full zarf.yaml
through ~50 tries, although I can't share it here because it includes private components. (Of course there is no error when I exclude the gpu-operator component, so the other images are not to blame for the failure.)
In any case, I believe that Zarf should handle failed image layer downloads more gracefully such that they don't get cached in a corrupted state. If that were fixed, Zarf's retry mechanism could work successfully, and the sporadic INTERNAL_ERROR
from the registry side would not ruin the pulling process.
Is there any possible workaround for this problem? For example doing docker pull on the images manually works, but I do not know if there is a way to make zarf use the local docker cache.
I also tried setting up a pull-through cache on AWS ECR but it seems they don't support Nvidia's registry.
Any ideas on a workaround would be great so that we can create packages in the meanwhile.
Yes if Zarf does not find an image, it will pull from the local docker image store. I'm not sure if Zarf will still fall back to the local docker store if it see's an image in a remote then fails to pull it. You may have to rename / retag the images
Thank you, this worked as a workaround! For anyone with the same problem, I first modified the hosts file to make nvcr.io unreachable and it used the local docker images, but it was extremely slow. Instead, setting up a local registry, pushing all the images and using --registry-override
during package create worked like a charm.
@ercanserteli This issue should be fixed since v0.34.0. If you are still having an issues feel free to reopen