zarf Zarf fails when pulling from Nvidia's container registry

Environment

Device and OS: Tested with Ubuntu 22.04 and Windows 11, AMD64 App version: v0.32.6

Steps to reproduce

Create a zarf.yaml with a component that includes Nvidia images from nvcr.io
Run zarf package create --confirm

This is the simplest zarf.yaml that I can get the error with, but it is not 100% consistent:

kind: ZarfPackageConfig
metadata:
  name: test-package
  version: 1.0.0
components:
  - name: gpu-operator
    required: true
    charts:
      - name: gpu-operator
        namespace: gpu-operator
        url: https://helm.ngc.nvidia.com/nvidia
        version: v23.9.2
    images:
      - nvcr.io/nvidia/cloud-native/k8s-driver-manager:v0.6.5

My usual yaml has the following component, which fails 100% of the time. Seems like the more images there are, higher the chance it will fail:

  - name: gpu-operator
    required: true
    charts:
      - name: gpu-operator
        namespace: gpu-operator
        url: https://helm.ngc.nvidia.com/nvidia
        version: v23.9.2
        valuesFiles:
          - ../k8s/base/gpu-operator-values.yaml
    images:
      - registry.k8s.io/nfd/node-feature-discovery:v0.14.2
      - nvcr.io/nvidia/dcgm-exporter:3.3.0-3.2.0-ubuntu22.04
      - nvcr.io/nvidia/gpu-feature-discovery:v0.8.2-ubi8
      - nvcr.io/nvidia/k8s-device-plugin:v0.14.5-ubi8
      - nvcr.io/nvidia/k8s/container-toolkit:v1.14.6-ubuntu20.04
      - nvcr.io/nvidia/gpu-operator:v23.9.2
      - nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.2
      - nvcr.io/nvidia/cloud-native/k8s-driver-manager:v0.6.5

For completeness' sake, the contents of `../k8s/base/gpu-operator-values.yaml" are:

toolkit:
  env:
    - name: CONTAINERD_CONFIG
      value: "/var/lib/rancher/k3s/agent/etc/containerd/config.toml"
    - name: CONTAINERD_SOCKET
      value: "/run/k3s/containerd/containerd.sock"
    - name: CONTAINERD_RUNTIME_CLASS
      value: "nvidia"
    - name: CONTAINERD_SET_AS_DEFAULT
      value: "true"

Expected result

It should pull the images normally and continue with the package creation.

Actual Result

It fails at the Loading metadata for n images stage. It retries 2 more times but that always results in expected blob size x, but only wrote 23. This also saves a corrupted cache, so once it fails this way, it will always fail on new runs until you clean the cache.

Visual Proof (screenshots, videos, text, etc)

It writes one of two error messages randomly, but each is about an INTERNAL_ERROR.

1:

  📦 PACKAGE IMAGES                                                                                   
                                                                                                      

  ✔  Loading metadata for 59 images. This step may take a couple of minutes to complete.

 WARNING  Failed to write image layers, trying again up to 3 times...

 WARNING  Retrying (1/3): Get
          "https://ngc.download.nvidia.com/containers/registry//docker/registry/v2/blobs/sha256/52/520797292d9250932259d95f471bef1f97712030c1d364f3f297260e5fee1de8/data?ak-token=exp=1711620178~acl=/containers/registry/docker/registry/v2/blobs/sha256/52/520797292d9250932259d95f471bef1f97712030c1d364f3f297260e5fee1de8/data*~hmac=0070c698b83e09d9915563609903e02bd5932dc33b94eaa6fe81605963afb363":
          stream error: stream ID 31; INTERNAL_ERROR; received from peer
  ✔  Loading metadata for 59 images. This step may take a couple of minutes to complete.

 WARNING  Failed to write image layers, trying again up to 3 times...

 WARNING  Potential image cache corruption: expected blob size 188, but only wrote 23 - try clearing
          cache with "zarf tools clear-cache"

 WARNING  Retrying (2/3): expected blob size 188, but only wrote 23
  ✔  Loading metadata for 59 images. This step may take a couple of minutes to complete.

 WARNING  Failed to write image layers, trying again up to 3 times...

 WARNING  Potential image cache corruption: expected blob size 2030, but only wrote 23 - try
          clearing cache with "zarf tools clear-cache"

 WARNING  Retrying (3/3): expected blob size 2030, but only wrote 23
     ERROR:  Failed to create package: unable to pull images after 3 attempts: expected blob size 2030, but only
             wrote 23

2:

                                                                                                      
  📦 PACKAGE IMAGES                                                                                   
                                                                                                      

  ✔  Loading metadata for 59 images. This step may take a couple of minutes to complete.

 WARNING  Failed to write image layers, trying again up to 3 times...

 WARNING  Retrying (1/3) in 5s: stream error: stream ID 89; INTERNAL_ERROR; received from peer
  ✔  Loading metadata for 59 images. This step may take a couple of minutes to complete.

 WARNING  Failed to write image layers, trying again up to 3 times...

 WARNING  Retrying (2/3) in 10s: remove
          C:\Users\ercan_c11zstp\.zarf-cache\images\sha256-e769c9462d1bfeb130b57c84903eab0c2d8a25298bac4543f04b78adad5414ae:
          The process cannot access the file because it is being used by another process.
  ✔  Loading metadata for 59 images. This step may take a couple of minutes to complete.

 WARNING  Failed to write image layers, trying again up to 3 times...

 WARNING  Potential image cache corruption: expected blob size 6884, but only wrote 23 - try
          clearing cache with "zarf tools clear-cache"

 WARNING  Retrying (3/3) in 20s: expected blob size 6884, but only wrote 23
     ERROR:  Failed to create package: unable to pull images after 3 attempts: expected blob size 6884, but only
             wrote 23

Severity/Priority

Very severe, because it completely blocks us from being able to create (and deploy) our package, which needs Nvidia GPU functionality to work.

Additional Context

This has been tested on multiple PCs/servers on different networks and throughout multiple days. I don't think it could be a temporary hickup or getting limited by nvcr.io. Running docker pull on the images can pull the images just fine. Even if it fails when downloading a layer, it should not corrupt the layer cache, it should instead be able to fix this on the retry stage.

Mar 28 '24 15:03 ercanserteli

@ercanserteli I am not able to reproduce the error you're seeing.

Using this zarf.yaml:

kind: ZarfPackageConfig
metadata:
  name: test-package
  version: 1.0.0
components:
  - name: gpu-operator
    required: true
    charts:
      - name: gpu-operator
        namespace: gpu-operator
        url: https://helm.ngc.nvidia.com/nvidia
        version: v23.9.2
        valuesFiles:
          - ./values.yaml
    images:
      - registry.k8s.io/nfd/node-feature-discovery:v0.14.2
      - nvcr.io/nvidia/k8s/dcgm-exporter:3.3.0-3.2.0-ubuntu22.04
      - nvcr.io/nvidia/gpu-feature-discovery:v0.8.2-ubi8
      - nvcr.io/nvidia/k8s-device-plugin:v0.14.5-ubi8
      - nvcr.io/nvidia/k8s/container-toolkit:v1.14.6-ubuntu20.04
      - nvcr.io/nvidia/gpu-operator:v23.9.2
      - nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.2
      - nvcr.io/nvidia/cloud-native/k8s-driver-manager:v0.6.5

note that I had to add k8s to this image reference: nvcr.io/nvidia/k8s/dcgm-exporter:3.3.0-3.2.0-ubuntu22.04

https://catalog.ngc.nvidia.com/orgs/nvidia/teams/k8s/containers/dcgm-exporter/tags

In the output you provided, it says Loading metadata for 59 images but the example zarf.yaml you provided only has 8 images. There seems to be a lot more images that your package has defined that zarf is trying to pull. Could you provide the zarf.yaml that's being used that has 59 images?

Mar 29 '24 14:03 lucasrod16

After waiting a few hours and trying again, I am seeing the error now. I suspect this is an issue related to NVIDIA's registry somehow as we have not seen this problem occur with other registries that I'm aware of.

Mar 29 '24 18:03 lucasrod16

You are right, the sample outputs I added were from running with the whole production zarf.yaml with more images overall, but as you confirmed the problem occurs even with only this component. I also do not see this problem with any other registry than Nvidia's. But it may be that the problem's occurrence rate increases when there are more images to be pulled overall. I have a 100% failure rate with the full zarf.yaml through ~50 tries, although I can't share it here because it includes private components. (Of course there is no error when I exclude the gpu-operator component, so the other images are not to blame for the failure.)

In any case, I believe that Zarf should handle failed image layer downloads more gracefully such that they don't get cached in a corrupted state. If that were fixed, Zarf's retry mechanism could work successfully, and the sporadic INTERNAL_ERROR from the registry side would not ruin the pulling process.

Apr 01 '24 09:04 ercanserteli

Is there any possible workaround for this problem? For example doing docker pull on the images manually works, but I do not know if there is a way to make zarf use the local docker cache.

I also tried setting up a pull-through cache on AWS ECR but it seems they don't support Nvidia's registry.

Any ideas on a workaround would be great so that we can create packages in the meanwhile.

Apr 03 '24 11:04 ercanserteli

Yes if Zarf does not find an image, it will pull from the local docker image store. I'm not sure if Zarf will still fall back to the local docker store if it see's an image in a remote then fails to pull it. You may have to rename / retag the images

Apr 03 '24 12:04 AustinAbro321

Thank you, this worked as a workaround! For anyone with the same problem, I first modified the hosts file to make nvcr.io unreachable and it used the local docker images, but it was extremely slow. Instead, setting up a local registry, pushing all the images and using --registry-override during package create worked like a charm.

Apr 05 '24 13:04 ercanserteli

@ercanserteli This issue should be fixed since v0.34.0. If you are still having an issues feel free to reopen

Jun 28 '24 19:06 AustinAbro321

zarf zarf copied to clipboard

Zarf fails when pulling from Nvidia's container registry

Environment

Steps to reproduce

Expected result

Actual Result

Visual Proof (screenshots, videos, text, etc)

Severity/Priority

Additional Context

zarf
zarf copied to clipboard