gpu-operator icon indicating copy to clipboard operation
gpu-operator copied to clipboard

repoConfig is not mounted into GDS container

Open age9990 opened this issue 1 year ago • 9 comments

1. Quick Debug Information

  • OS/Version(e.g. RHEL8.6, Ubuntu22.04): Ubuntu20.04
  • Kernel Version: 5.15.x
  • Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): CRI-O
  • K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): K8s
  • GPU Operator Version: v23.9.0

2. Issue or feature description

  1. From the code we can see the repoConfig is not mounted into GDS container, so the apt repository cannot be set to on-premise repository, causing the container in CrashLoopBackoff state. It should contain the following in nvidia-fs-ctr https://github.com/NVIDIA/gpu-operator/blob/master/manifests/state-driver/0500_daemonset.yaml {{- if and .AdditionalConfigs .AdditionalConfigs.VolumeMounts }} {{- range .AdditionalConfigs.VolumeMounts }}

  2. What's more, the GDS image name should concatenate os info, like what we do for nvidia driver pod. The default values.yaml, will cause the image pull backoff since the image tag is not correct (missing os, it should be 2.16.1-ubuntu20.04) gds: version: "2.16.1" From the code, the os is not used to construct imagePath. https://github.com/NVIDIA/gpu-operator/blob/79fe1cc0923d356d891396498e0cd8f844a711ad/internal/state/driver.go#L533 driver image path does reference os. https://github.com/NVIDIA/gpu-operator/blob/master/internal/state/driver.go#L472

3. Steps to reproduce the issue

Enable gds then the issue is reproduced.

@shivamerla Please help to resolve these issues to use GDS properly.

age9990 avatar Nov 13 '23 16:11 age9990

@age9990 we are planning to fix this with v23.9.1 (ETA next week). Meanwhile if you want to try out early bits use following.

--set driver.version=535.129.03
--set operator.repository=registry.gitlab.com/nvidia/kubernetes/gpu-operator/staging
--set operator.version=master-latest-ubi8

shivamerla avatar Dec 01 '23 01:12 shivamerla

@shivamerla Tried v23.9.1 today, the repoConfig is still not mounted as additional volume. I also tried the cert-config, it is not mounted as well. As for GDS image tag, it correctly append os info when not enabling nvidiaDriver CRD. However, if I enabled nvidiaDriver CRD, the os info is not appended, causing image pull backoff.

age9990 avatar Dec 11 '23 15:12 age9990

@age9990 Can you share the pod yaml, describe output and pod logs when you try it with the NVIDIADriver CR?

tariq1890 avatar Dec 11 '23 23:12 tariq1890

@tariq1890 Helm values.yaml and driver pod yaml attached. values.txt driver_pod.txt

age9990 avatar Dec 12 '23 12:12 age9990

can you share the NVIDIADriver CR yaml? You need to make sure that the repoConfig field is set over there just like the ClusterPolicy CR

tariq1890 avatar Dec 12 '23 21:12 tariq1890

@tariq1890 repoConfig is present in both ClusterPolicy CR and NVIDIADriver CR, as you can see from the driver pod yaml file it is mounted in nvidia-driver-ctr and nvidia-peermem-ctr pod. nvidiaDriver_cr.txt

age9990 avatar Dec 13 '23 11:12 age9990

Hey @age9990 , thanks for bringing this to our notice. We have confirmed that there is a bug with how the GDS container image names are generated. We will publish this to the next planned release

In the meantime can you try this image ?

--set driver.version=535.129.03
--set operator.repository=registry.gitlab.com/nvidia/kubernetes
--set operator.version=72678615-ubi8

tariq1890 avatar Dec 15 '23 22:12 tariq1890

@tariq1890 Thanks for fixing image name issue. What about the repoConfig volumeMounts issue? I'm not familiar with GO lang, but I see the code you use to get gdsContainer is different from other functions. There is no '&' in front of the line, while others do. gdsContainer := obj.Spec.Template.Spec.Containers[i] https://github.com/NVIDIA/gpu-operator/blob/fd2b1587d5a8a7cd5a3b28afbf2be80d67d0d3d5/controllers/object_controls.go#L2462

age9990 avatar Dec 16 '23 02:12 age9990

Hi, @tariq1890 , I've seen the fix to the issues are merged to master branch, can we expect v23.9.2 be released soon?

age9990 avatar Jan 04 '24 14:01 age9990

Hi @age9990 GPU Operator 24.3.0 has been released and contains a fix for this issue. https://github.com/NVIDIA/gpu-operator/releases/tag/v24.3.0

I am closing this issue. But please re-open if you still encountering this with 24.3.0.

cdesiniotis avatar May 02 '24 20:05 cdesiniotis