gpu-operator icon indicating copy to clipboard operation
gpu-operator copied to clipboard

vgpu-driver-manager doesn't work on airgapped Openshift due to lspci requirements

Open NoOverflow opened this issue 3 weeks ago • 2 comments

Hello,

When installing the vgpu-driver-manager, I've noticed that the openshift driver toolkit container tries to download the lspci package which is a hard requirement for vgpu setup.

Here's the concerned line for RHEL9 based vgpu manager: https://github.com/NVIDIA/gpu-driver-container/blob/9999eb06e44aa75d88587074adddd42bde61404c/vgpu-manager/rhel9/ocp_dtk_entrypoint#L117

Unfortunately this causes an issue when the container is airgapped / or access the internet through a proxy since you can't run a package installation.

To be able to continue, I had to patch the daemonset manually to add two things:

  • The corresponding proxy variables (HTTPS_PROXY...) since the env overrides in the ClusterPolicy object are only applied to the nvidia-vgpu-manager-ctr container and not the openshift-driver-toolkit-ctr

https://github.com/NVIDIA/gpu-operator/blob/4011723a584b9a306fc7bb2368961c053ef283cd/controllers/object_controls.go#L3043

  • A volume containing our CA certificate bundle and an associated volume mount as some proxies use TLS interception and require additional CAs.

From my point of view, this could be fixed quite easily using one of two ways:

  • Dynamically. Change the transformOpenShiftDriverToolkitContainer function to also overload the container environment variables just like transformVGPUManagerContainer does. And then add another field in the ClusterPolicy object to allow arbitrary volume mounts.
  • Statically. Instead do these changes directly during the image build by transforming the Dockerfiles and Makefile in https://github.com/NVIDIA/gpu-driver-container and adding an option to specify a proxy and CA volume override.

Let me know what you think, good day 😄

NoOverflow avatar Dec 18 '25 17:12 NoOverflow

Hi @NoOverflow, we now include lspci in our vgpu-manager container image so that the DTK container does not need to install it at runtime. See https://github.com/NVIDIA/gpu-driver-container/commit/81979dbf4d395e184ac3796562e8371c89a2be7d and https://github.com/NVIDIA/gpu-driver-container/commit/9ee349b36b7611c94304aea56fe33054fb1ed149. Would you be able to re-build your vgpu-manager container image from top-of-tree and see if that resolves the issue?

cdesiniotis avatar Dec 18 '25 17:12 cdesiniotis

Hi @cdesiniotis, that's far better; it was weird to see packages installed at runtime. I'll try this out tomorrow, thanks for the heads up ! Will close the issue accordingly.

NoOverflow avatar Dec 18 '25 18:12 NoOverflow