gpu-operator repoConfig is not mounted into GDS container

1. Quick Debug Information

OS/Version(e.g. RHEL8.6, Ubuntu22.04): Ubuntu20.04
Kernel Version: 5.15.x
Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): CRI-O
K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): K8s
GPU Operator Version: v23.9.0

2. Issue or feature description

From the code we can see the repoConfig is not mounted into GDS container, so the apt repository cannot be set to on-premise repository, causing the container in CrashLoopBackoff state. It should contain the following in nvidia-fs-ctr https://github.com/NVIDIA/gpu-operator/blob/master/manifests/state-driver/0500_daemonset.yaml {{- if and .AdditionalConfigs .AdditionalConfigs.VolumeMounts }} {{- range .AdditionalConfigs.VolumeMounts }}
What's more, the GDS image name should concatenate os info, like what we do for nvidia driver pod. The default values.yaml, will cause the image pull backoff since the image tag is not correct (missing os, it should be 2.16.1-ubuntu20.04) gds: version: "2.16.1" From the code, the os is not used to construct imagePath. https://github.com/NVIDIA/gpu-operator/blob/79fe1cc0923d356d891396498e0cd8f844a711ad/internal/state/driver.go#L533 driver image path does reference os. https://github.com/NVIDIA/gpu-operator/blob/master/internal/state/driver.go#L472

3. Steps to reproduce the issue

Enable gds then the issue is reproduced.

@shivamerla Please help to resolve these issues to use GDS properly.

Nov 13 '23 16:11 age9990

@age9990 we are planning to fix this with v23.9.1 (ETA next week). Meanwhile if you want to try out early bits use following.

--set driver.version=535.129.03
--set operator.repository=registry.gitlab.com/nvidia/kubernetes/gpu-operator/staging
--set operator.version=master-latest-ubi8

Dec 01 '23 01:12 shivamerla

@shivamerla Tried v23.9.1 today, the repoConfig is still not mounted as additional volume. I also tried the cert-config, it is not mounted as well. As for GDS image tag, it correctly append os info when not enabling nvidiaDriver CRD. However, if I enabled nvidiaDriver CRD, the os info is not appended, causing image pull backoff.

Dec 11 '23 15:12 age9990

@age9990 Can you share the pod yaml, describe output and pod logs when you try it with the NVIDIADriver CR?

Dec 11 '23 23:12 tariq1890

@tariq1890 Helm values.yaml and driver pod yaml attached. values.txt driver_pod.txt

Dec 12 '23 12:12 age9990

can you share the NVIDIADriver CR yaml? You need to make sure that the repoConfig field is set over there just like the ClusterPolicy CR

Dec 12 '23 21:12 tariq1890

@tariq1890 repoConfig is present in both ClusterPolicy CR and NVIDIADriver CR, as you can see from the driver pod yaml file it is mounted in nvidia-driver-ctr and nvidia-peermem-ctr pod. nvidiaDriver_cr.txt

Dec 13 '23 11:12 age9990

Hey @age9990 , thanks for bringing this to our notice. We have confirmed that there is a bug with how the GDS container image names are generated. We will publish this to the next planned release

In the meantime can you try this image ?

--set driver.version=535.129.03
--set operator.repository=registry.gitlab.com/nvidia/kubernetes
--set operator.version=72678615-ubi8

Dec 15 '23 22:12 tariq1890

@tariq1890 Thanks for fixing image name issue. What about the repoConfig volumeMounts issue? I'm not familiar with GO lang, but I see the code you use to get gdsContainer is different from other functions. There is no '&' in front of the line, while others do. gdsContainer := obj.Spec.Template.Spec.Containers[i] https://github.com/NVIDIA/gpu-operator/blob/fd2b1587d5a8a7cd5a3b28afbf2be80d67d0d3d5/controllers/object_controls.go#L2462

Dec 16 '23 02:12 age9990

Hi, @tariq1890 , I've seen the fix to the issues are merged to master branch, can we expect v23.9.2 be released soon?

Jan 04 '24 14:01 age9990

Hi @age9990 GPU Operator 24.3.0 has been released and contains a fix for this issue. https://github.com/NVIDIA/gpu-operator/releases/tag/v24.3.0

I am closing this issue. But please re-open if you still encountering this with 24.3.0.

May 02 '24 20:05 cdesiniotis

gpu-operator gpu-operator copied to clipboard

repoConfig is not mounted into GDS container

1. Quick Debug Information

2. Issue or feature description

3. Steps to reproduce the issue

gpu-operator
gpu-operator copied to clipboard