gpu-operator
gpu-operator copied to clipboard
550.90.07-5.15.0-1061-gke-ubuntu22.04 image tag not found when installing with `driver.usePrecompiled` on GKE
The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.
Important Note: NVIDIA AI Enterprise customers can get support from NVIDIA Enterprise support. Please open a case here.
1. Quick Debug Information
- OS/Version(e.g. RHEL8.6, Ubuntu22.04): Ubuntu 22.04
- Kernel Version: 5.15.0-1061-gke
- Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): containerd
- K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): GKE
- GPU Operator Version: 24.6.1
2. Issue or feature description
When the operator is installed with driver.usePrecompiled: true on GKE, the nvidia-driver-daemonset-5.15.0-1061-gke-ubuntu22.04 DaemonSet fails to start because the image tag 550.90.07-5.15.0-1061-gke-ubuntu22.04 cannot be found in the nvcr.io/nvidia/driver repository.
3. Steps to reproduce the issue
- Install the v24.6.1 operator on GKE with the following in your values file.
driver:
enabled: true
usePrecompiled: true
- See the operator spawns a DaemonSet called
nvidia-driver-daemonset-5.15.0-1061-gke-ubuntu22.04 - See the image defined in the DaemonSet cannot be pulled.
Yes, I understand per the GKE docs here that drivers must be separately installed and from here that driver installation must be disabled via the operator, but it seems like there should be a validation check in the operator preventing installation if an image tag doesn't exist.
@chipzoller there are no precompiled driver packages for the gke kernels which is why we do not have any precompiled container images for Ubuntu 22.04 + gke kernel variant. If you want the GPU Operator to deploy and manage the lifecycle of the driver, you will need to use the non-precompiled images.
Hi @cdesiniotis, yes I get that, just stating with this issue that there isn't any mechanism to prevent users from hitting this situation. My recommendation is some template logic which blocks this condition so the chart fails to deploy rather than happily being deployed only for some component to fail to come up due to an unavailable tag.
@chipzoller the kernel version, and thus the precompiled driver image tag, is not known until runtime. The gpu-operator constructs the image tag from the OS name + kernel version running on the GPU node -- it gets the needed information from node labels added by Node Feature Discovery. I don't believe this is something we can easily validate at the point in time when the chart is installed.
You should be able to use the Helm lookup() function to retrieve a node's labels and then fail conditionally. This would have some potential negative implications, however, as some tools don't support this including some cloud vendor marketplace catalogs if I recall. An alternative could be to fail in the operator container and print the relevant message rather than template a resource with an invalid image tag. If none of those seem like viable options, feel free to close this as not planned. Just throwing some ideas out there that may help others.
An alternative could be to fail in the operator container and print the relevant message rather than template a resource with an invalid image tag.
This seems like the most reasonable option if we wanted to fail earlier.
This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed. To skip these checks, apply the "lifecycle/frozen" label.
This issue has been open for over 90 days without recent updates, and the context may now be outdated.
Given that gpu-operator 24.6.1 is EOL now, I would encourage you to try latest version and see if you still see this issue.
If this issue is still relevant with the latest version of the NVIDIA GPU Operator, please feel free to reopen it or open a new one with updated details.