gpu-operator
gpu-operator copied to clipboard
repoConfig is not mounted into GDS container
1. Quick Debug Information
- OS/Version(e.g. RHEL8.6, Ubuntu22.04): Ubuntu20.04
- Kernel Version: 5.15.x
- Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): CRI-O
- K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): K8s
- GPU Operator Version: v23.9.0
2. Issue or feature description
-
From the code we can see the repoConfig is not mounted into GDS container, so the apt repository cannot be set to on-premise repository, causing the container in CrashLoopBackoff state. It should contain the following in nvidia-fs-ctr https://github.com/NVIDIA/gpu-operator/blob/master/manifests/state-driver/0500_daemonset.yaml {{- if and .AdditionalConfigs .AdditionalConfigs.VolumeMounts }} {{- range .AdditionalConfigs.VolumeMounts }}
-
What's more, the GDS image name should concatenate os info, like what we do for nvidia driver pod. The default values.yaml, will cause the image pull backoff since the image tag is not correct (missing os, it should be 2.16.1-ubuntu20.04) gds: version: "2.16.1" From the code, the os is not used to construct imagePath. https://github.com/NVIDIA/gpu-operator/blob/79fe1cc0923d356d891396498e0cd8f844a711ad/internal/state/driver.go#L533 driver image path does reference os. https://github.com/NVIDIA/gpu-operator/blob/master/internal/state/driver.go#L472
3. Steps to reproduce the issue
Enable gds then the issue is reproduced.
@shivamerla Please help to resolve these issues to use GDS properly.
@age9990 we are planning to fix this with v23.9.1 (ETA next week). Meanwhile if you want to try out early bits use following.
--set driver.version=535.129.03
--set operator.repository=registry.gitlab.com/nvidia/kubernetes/gpu-operator/staging
--set operator.version=master-latest-ubi8
@shivamerla Tried v23.9.1 today, the repoConfig is still not mounted as additional volume. I also tried the cert-config, it is not mounted as well. As for GDS image tag, it correctly append os info when not enabling nvidiaDriver CRD. However, if I enabled nvidiaDriver CRD, the os info is not appended, causing image pull backoff.
@age9990 Can you share the pod yaml, describe output and pod logs when you try it with the NVIDIADriver CR?
@tariq1890 Helm values.yaml and driver pod yaml attached. values.txt driver_pod.txt
can you share the NVIDIADriver CR yaml? You need to make sure that the repoConfig
field is set over there just like the ClusterPolicy CR
@tariq1890 repoConfig is present in both ClusterPolicy CR and NVIDIADriver CR, as you can see from the driver pod yaml file it is mounted in nvidia-driver-ctr and nvidia-peermem-ctr pod. nvidiaDriver_cr.txt
Hey @age9990 , thanks for bringing this to our notice. We have confirmed that there is a bug with how the GDS container image names are generated. We will publish this to the next planned release
In the meantime can you try this image ?
--set driver.version=535.129.03
--set operator.repository=registry.gitlab.com/nvidia/kubernetes
--set operator.version=72678615-ubi8
@tariq1890 Thanks for fixing image name issue. What about the repoConfig volumeMounts issue? I'm not familiar with GO lang, but I see the code you use to get gdsContainer is different from other functions. There is no '&' in front of the line, while others do. gdsContainer := obj.Spec.Template.Spec.Containers[i] https://github.com/NVIDIA/gpu-operator/blob/fd2b1587d5a8a7cd5a3b28afbf2be80d67d0d3d5/controllers/object_controls.go#L2462
Hi, @tariq1890 , I've seen the fix to the issues are merged to master branch, can we expect v23.9.2 be released soon?
Hi @age9990 GPU Operator 24.3.0 has been released and contains a fix for this issue. https://github.com/NVIDIA/gpu-operator/releases/tag/v24.3.0
I am closing this issue. But please re-open if you still encountering this with 24.3.0.