Driver Validation Support on Custom Driver Installation Path
1. Quick Debug Information
- OS/Version(e.g. RHEL8.6, Ubuntu22.04): COS/Ubuntu
- Kernel Version:
- Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): Containerd
- K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): GKE
- GPU Operator Version: Any version
2. Issue or feature description
Driver validation is a pre-step for the GPU Operator to work properly. Currently, it supports two places where the driver is installed - default root (/run/nvidia/driver) or root(/). And the validator utilize the nvidia-smi to check the driver by chroot to the driver installation path
However, the way GKE install and manage the GPU driver is not compatible with how GPU operator assumes:
After driver installation, it assumes a file exists /run/nvidia/validations/.driver-ctr-ready. [assertion]
Driver is installed in a custom path (/home/kubernetes/bin/nvidia), the GPU Operator can’t discover the driver’s library unless the path is told to the Operator
To make driver validation compatible with GKE, below are areas requiring changes:
-
Support custom driver/library path in GPU Operator When driver enable is set to False, the user can set the specific driver installation path e.g: /home/kubernetes/bin/nvidia and then in GPU Operator, it auto uses this path for its config.
Within validator code logic, when custom driver path detected, it could skip driver-ctr-ready file assertion logic. What’s more, change the way it run nvidia-smi -
Support custom driver path within GPU Operator Components (e.g device plugin). Existing device plugin supports custom root, but the operator doesn't support passing the parameter to device plugin, container toolkit and other components
3. Steps to reproduce the issue
Just deploy the GPU Operator on Ubuntu or COS nodes with the GKE installed Driver. and the driver installation check will fail
@Dragoncell thanks for the details here. It makes sense to make hostDriverRoot configurable and to propagate this setting throughout all of our components which depend on it. In fact, there is an open PR for introducing a similar hostRoot option https://gitlab.com/nvidia/kubernetes/gpu-operator/-/merge_requests/960. A natural follow-up to this PR would be to introduce a hostDriverRoot option, as hostRoot may not equal hostDriverRoot (as is the case with GKE).
One aspect that I forgot -- the driver installation folder in COS does not represent a "driver root" in the classical Operator sense since we don't have /dev nodes there and one cannot chroot into it. So the enablement here will be more complex than just adding a new hostDriverRoot option.
/cc
Hi folks, indeed we are also interested in this. https://gitlab.com/nvidia/kubernetes/gpu-operator/-/merge_requests/960 gets us halfway there, and I imagine it is close to get merged.
Afterwards, I will jump in and create a PR for a hostDriverRoot as well.
Glad to see more interest in this, hopefully it will help things to progress faster.
Also thanks a lot @cdesiniotis for the reviews and suggestions.
Thanks for taking a look into the issue cdesiniotis, and thanks for the MR from neoaggelos: https://gitlab.com/nvidia/kubernetes/gpu-operator/-/merge_requests/960
As pointed out from cdesiniotis, our use case is a little different from the custom hostRoot as the driver installed on the host doesn't have a /dev nodes and can't be chrooted. Therefore in our case the configuration looks like
hostRoot = /
hostDriverRoot = /home/kubernetes/bin/nvidia
I see the required changes are like:
- Introduce the hostDriverRoot as well similar to the hostRoot MR
- Update the validation logic based on the hostDriverRoot (has /dev nodes or not): a) Replace chroot check if the hostDriverRoot can't be chrooted: https://github.com/NVIDIA/gpu-operator/blob/30bc55d7f2e3419a54eb298dbd09f499c968a659/validator/main.go#L627 b) Under hostRoot check, using the hostDriverRoot instead of /usr/bin: https://github.com/NVIDIA/gpu-operator/blob/30bc55d7f2e3419a54eb298dbd09f499c968a659/validator/main.go#L596
Do you prefer to make the changes in one MR or separate it from the hostRoot , and glad to see what's your thoughts on it?
gitlab commit: https://gitlab.com/nvidia/kubernetes/gpu-operator/-/merge_requests/1061