gpu-operator Driver Validation Support on Custom Driver Installation Path

1. Quick Debug Information

OS/Version(e.g. RHEL8.6, Ubuntu22.04): COS/Ubuntu
Kernel Version:
Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): Containerd
K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): GKE
GPU Operator Version: Any version

2. Issue or feature description

Driver validation is a pre-step for the GPU Operator to work properly. Currently, it supports two places where the driver is installed - default root (/run/nvidia/driver) or root(/). And the validator utilize the nvidia-smi to check the driver by chroot to the driver installation path

However, the way GKE install and manage the GPU driver is not compatible with how GPU operator assumes: After driver installation, it assumes a file exists /run/nvidia/validations/.driver-ctr-ready. [assertion] Driver is installed in a custom path (/home/kubernetes/bin/nvidia), the GPU Operator can’t discover the driver’s library unless the path is told to the Operator

To make driver validation compatible with GKE, below are areas requiring changes:

Support custom driver/library path in GPU Operator When driver enable is set to False, the user can set the specific driver installation path e.g: /home/kubernetes/bin/nvidia and then in GPU Operator, it auto uses this path for its config.
Within validator code logic, when custom driver path detected, it could skip driver-ctr-ready file assertion logic. What’s more, change the way it run nvidia-smi
Support custom driver path within GPU Operator Components (e.g device plugin). Existing device plugin supports custom root, but the operator doesn't support passing the parameter to device plugin, container toolkit and other components

3. Steps to reproduce the issue

Just deploy the GPU Operator on Ubuntu or COS nodes with the GKE installed Driver. and the driver installation check will fail

Jan 20 '24 00:01 Dragoncell

@Dragoncell thanks for the details here. It makes sense to make hostDriverRoot configurable and to propagate this setting throughout all of our components which depend on it. In fact, there is an open PR for introducing a similar hostRoot option https://gitlab.com/nvidia/kubernetes/gpu-operator/-/merge_requests/960. A natural follow-up to this PR would be to introduce a hostDriverRoot option, as hostRoot may not equal hostDriverRoot (as is the case with GKE).

Jan 24 '24 19:01 cdesiniotis

One aspect that I forgot -- the driver installation folder in COS does not represent a "driver root" in the classical Operator sense since we don't have /dev nodes there and one cannot chroot into it. So the enablement here will be more complex than just adding a new hostDriverRoot option.

Jan 24 '24 19:01 cdesiniotis

/cc

Jan 24 '24 21:01 bobbypage

Hi folks, indeed we are also interested in this. https://gitlab.com/nvidia/kubernetes/gpu-operator/-/merge_requests/960 gets us halfway there, and I imagine it is close to get merged.

Afterwards, I will jump in and create a PR for a hostDriverRoot as well.

Glad to see more interest in this, hopefully it will help things to progress faster.

Also thanks a lot @cdesiniotis for the reviews and suggestions.

Jan 26 '24 07:01 neoaggelos

Thanks for taking a look into the issue cdesiniotis, and thanks for the MR from neoaggelos: https://gitlab.com/nvidia/kubernetes/gpu-operator/-/merge_requests/960

As pointed out from cdesiniotis, our use case is a little different from the custom hostRoot as the driver installed on the host doesn't have a /dev nodes and can't be chrooted. Therefore in our case the configuration looks like

hostRoot = /
hostDriverRoot = /home/kubernetes/bin/nvidia

I see the required changes are like:

Introduce the hostDriverRoot as well similar to the hostRoot MR
Update the validation logic based on the hostDriverRoot (has /dev nodes or not): a) Replace chroot check if the hostDriverRoot can't be chrooted: https://github.com/NVIDIA/gpu-operator/blob/30bc55d7f2e3419a54eb298dbd09f499c968a659/validator/main.go#L627 b) Under hostRoot check, using the hostDriverRoot instead of /usr/bin: https://github.com/NVIDIA/gpu-operator/blob/30bc55d7f2e3419a54eb298dbd09f499c968a659/validator/main.go#L596

Do you prefer to make the changes in one MR or separate it from the hostRoot , and glad to see what's your thoughts on it?

Feb 13 '24 19:02 Dragoncell

gitlab commit: https://gitlab.com/nvidia/kubernetes/gpu-operator/-/merge_requests/1061

Apr 08 '24 18:04 Dragoncell