gpu-operator 1.9.0 chart issue with repo-config in air-gapped env

Issue when configuring local repo configMap for air-gapped env using 1.9.0 operator chart. Default CentOS and cuda repos are still being used/configured in /etc/yum.repos.d/ in nvidia-driver-daemonset pod

...
Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...
Checking NVIDIA driver packages...
Updating the package cache...
Repository base is listed more than once in the configuration
Repository updates is listed more than once in the configuration
Repository extras is listed more than once in the configuration
Repository centosplus is listed more than once in the configuration
Repository cuda is listed more than once in the configuration
Could not retrieve mirrorlist http://mirrorlist.centos.org/?release=7&arch=x86_64&repo=os&infra=container error was
14: curl#7 - "Failed to connect to 2001:4178:5:200::10: Network is unreachable"
...
Cannot find a valid baseurl for repo: base/7/x86_64
Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...

Dec 17 '21 22:12 yug0slav

Hi @yug0slav

Can you try naming your repo configuration files CentOS-Vault.repo and cuda.repo and retry this? Naming them as such will replace the existing repo configuration files in the driver image at /etc/yum.repos.d/, which will avoid this error.

apt will raise a warning for repo's it cannot reach, but it will not fail. yum appears to behave differently.

Jan 26 '22 00:01 cdesiniotis

cert-config CM described in air-gapped docs isn't working...
- looks like certs are being dropped into /etc/pki/ca-trust/extracted/pem/ but not actually installed.
- drop into /etc/pki/ca-trust/sources/anchors/ and update-ca-trust instead???
switched to HTTP and change repo-config to make it work

apiVersion: v1
kind: ConfigMap
metadata:
  name: repo-config
  namespace: gpu-operator
data:
  CentOS-Base.repo: |
    [base]
    name=CentOS-7 - Base
    ...

  CentOS-Updates.repo: |
    [updates]
    name=CentOS-7 - Updates
    ...

  CentOS-Extras.repo: |
    [extras]
    name=CentOS-7 - Extras
    ...

  CentOS-Plus.repo: |
    [centosplus]
    name=CentOS-7 - Plus
    ...
  
  EPEL.repo: |
    [epel]
    name=CentOS-7 - EPEL
    ...
  
  cuda.repo: |
    [cuda]
    name=cuda
    ...

Feb 01 '22 16:02 yug0slav

Thanks for more details. If I am understanding you correctly, there are two issues. Correct me if I am wrong.

On CentOS 7, you have to name your repo config files exactly the same as the original ones you are trying to replace (e.g. CentOS-Base.repo, cuda.repo). yum will complain about duplicate repo entries otherwise.
Mounting custom keys/certificates through the cert-config ConfigMap is not functional.

For 1, we will improve our documentation.

For 2, we will need to bring up a proper "air-gapped" environment for CentOS 7 and get back to you. Are you able to use HTTP for now and successfully install GPU Operator 1.9 in your air-gapped environment?

Feb 03 '22 00:02 cdesiniotis

I am using HTTP for now... and yes cert-config isn't functioning in centos pod.

Feb 08 '22 18:02 yug0slav

Hi @yug0slav

Can you try naming your repo configuration files CentOS-Vault.repo and cuda.repo and retry this? Naming them as such will replace the existing repo configuration files in the driver image at /etc/yum.repos.d/, which will avoid this error.

apt will raise a warning for repo's it cannot reach, but it will not fail. yum appears to behave differently.

Hi @cdesiniotis,

apt seem to only raise a warning, but it seems nvidia-driver-daemonset pod still need packages from official repos to work:

Using official vGPU driver image from AI Enterprise nvcr.io/nvaie/vgpu-guest-driver-2-0:510.47.03-ubuntu20.04 in airgap environment:

DRIVER_ARCH is x86_64
found 1 vgpu devices on host
vgpu driver version selected: 510.47.03-grid
Creating directory NVIDIA-Linux-x86_64-510.47.03-grid
Verifying archive integrity... OK
Uncompressing NVIDIA Accelerated Graphics Driver for Linux-x86_64 510.47.03.........................................................................                                                                                         ....................................................................................................................................................                                                                                         ....................................................................................................................................................                                                                                         ....................................................................................................................................................                                                                                         .........................................................................................

WARNING: Unable to determine the default X library path. The path /tmp/null/lib will be used, but this path was not detected in the ldconfig(8) cach                                                                                         e, and no directory exists at this path, so it is likely that libraries installed there will not be found by the loader.


WARNING: You specified the '--no-kernel-module' command line option, nvidia-installer will not install a kernel module as part of this driver instal                                                                                         lation, and it will not remove existing NVIDIA kernel modules not part of an earlier NVIDIA driver installation.  Please ensure that an NVIDIA kerne                                                                                         l module matching this driver version is installed separately.


========== NVIDIA Software Installer ==========

Starting installation of NVIDIA driver version 510.47.03-grid for Linux kernel version 5.4.0-97-generic

Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...
Checking NVIDIA driver packages...
Updating the package cache...
W: Failed to fetch http://archive.ubuntu.com/ubuntu/dists/focal/InRelease  Temporary failure resolving 'archive.ubuntu.com'
W: Failed to fetch http://archive.ubuntu.com/ubuntu/dists/focal-updates/InRelease  Temporary failure resolving 'archive.ubuntu.com'
W: Failed to fetch http://archive.ubuntu.com/ubuntu/dists/focal-security/InRelease  Temporary failure resolving 'archive.ubuntu.com'
W: Failed to fetch https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/InRelease  Temporary failure resolving 'developer.down                                                                                         load.nvidia.com'
W: Some index files failed to download. They have been ignored, or old ones used instead.
Resolving Linux kernel version...
Could not resolve Linux kernel version
Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...

If we rebuild image including our local repositorys the driver pod is working:

DRIVER_ARCH is x86_64
found 1 vgpu devices on host
vgpu driver version selected: 510.47.03-grid
Creating directory NVIDIA-Linux-x86_64-510.47.03-grid
Verifying archive integrity... OK
Uncompressing NVIDIA Accelerated Graphics Driver for Linux-x86_64 510.47.03..............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

WARNING: Unable to determine the default X library path. The path /tmp/null/lib will be used, but this path was not detected in the ldconfig(8) cache, and no directory exists at this path, so it is likely that libraries installed there will not be found by the loader.


WARNING: You specified the '--no-kernel-module' command line option, nvidia-installer will not install a kernel module as part of this driver installation, and it will not remove existing NVIDIA kernel modules not part of an earlier NVIDIA driver installation.  Please ensure that an NVIDIA kernel module matching this driver version is installed separately.


========== NVIDIA Software Installer ==========

Starting installation of NVIDIA driver version 510.47.03-grid for Linux kernel version 5.4.0-97-generic

Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...
Checking NVIDIA driver packages...
Updating the package cache...
Resolving Linux kernel version...
Proceeding with Linux kernel version 5.4.0-97-generic
Installing Linux kernel headers...
Installing Linux kernel module files...
Generating Linux kernel version string...
Compiling NVIDIA driver kernel modules...
/usr/src/nvidia-510.47.03-grid/kernel/nvidia/nv-dma.c:986: warning: "IMPORT_SGT_STUBS_NEEDED" redefined
  986 | #define IMPORT_SGT_STUBS_NEEDED 0
      |
/usr/src/nvidia-510.47.03-grid/kernel/nvidia/nv-dma.c:980: note: this is the location of the previous definition
  980 | #define IMPORT_SGT_STUBS_NEEDED 1
      |
/usr/src/nvidia-510.47.03-grid/kernel/nvidia/nv-procfs.o: warning: objtool: .text.unlikely: unexpected end of section
/usr/src/nvidia-510.47.03-grid/kernel/nvidia-uvm/uvm.c: In function 'uvm_mmap':
/usr/src/nvidia-510.47.03-grid/kernel/nvidia-uvm/uvm.c:887:1: warning: label 'out_va_space_unlock' defined but not used [-Wunused-label]
  887 | out_va_space_unlock:
      | ^~~~~~~~~~~~~~~~~~~
/usr/src/nvidia-510.47.03-grid/kernel/nvidia-drm/nvidia-drm-crtc.c: In function 'cursor_plane_req_config_update':
/usr/src/nvidia-510.47.03-grid/kernel/nvidia-drm/nvidia-drm-crtc.c:81:32: warning: unused variable 'nv_drm_plane_state' [-Wunused-variable]
   81 |     struct nv_drm_plane_state *nv_drm_plane_state =
      |                                ^~~~~~~~~~~~~~~~~~
/usr/src/nvidia-510.47.03-grid/kernel/nvidia-drm/nvidia-drm-crtc.c:80:27: warning: unused variable 'nv_dev' [-Wunused-variable]
   80 |     struct nv_drm_device *nv_dev = to_nv_device(plane->dev);
      |                           ^~~~~~
/usr/src/nvidia-510.47.03-grid/kernel/nvidia-drm/nvidia-drm-crtc.c: In function 'plane_req_config_update':
/usr/src/nvidia-510.47.03-grid/kernel/nvidia-drm/nvidia-drm-crtc.c:182:9: warning: unused variable 'ret' [-Wunused-variable]
  182 |     int ret = 0;
      |         ^~~
/usr/src/nvidia-510.47.03-grid/kernel/nvidia-drm/nvidia-drm-crtc.c: In function 'nv_drm_plane_atomic_set_property':
/usr/src/nvidia-510.47.03-grid/kernel/nvidia-drm/nvidia-drm-crtc.c:497:32: warning: unused variable 'nv_drm_plane_state' [-Wunused-variable]
  497 |     struct nv_drm_plane_state *nv_drm_plane_state =
      |                                ^~~~~~~~~~~~~~~~~~
/usr/src/nvidia-510.47.03-grid/kernel/nvidia-drm/nvidia-drm-crtc.c: In function 'nv_drm_enumerate_crtcs_and_planes':
/usr/src/nvidia-510.47.03-grid/kernel/nvidia-drm/nvidia-drm-crtc.c:1141:13: warning: ISO C90 forbids mixed declarations and code [-Wdeclaration-after-statement]
 1141 |             struct drm_plane *overlay_plane =
      |             ^~~~~~
/usr/src/nvidia-510.47.03-grid/kernel/nvidia-drm/nvidia-drm-modeset.c: In function '__will_generate_flip_event':
/usr/src/nvidia-510.47.03-grid/kernel/nvidia-drm/nvidia-drm-modeset.c:98:10: warning: unused variable 'overlay_event' [-Wunused-variable]
   98 |     bool overlay_event = false;
      |          ^~~~~~~~~~~~~
/usr/src/nvidia-510.47.03-grid/kernel/nvidia-drm/nvidia-drm-modeset.c:97:10: warning: unused variable 'primary_event' [-Wunused-variable]
   97 |     bool primary_event = false;
      |          ^~~~~~~~~~~~~
/usr/src/nvidia-510.47.03-grid/kernel/nvidia-drm/nvidia-drm-modeset.c:96:23: warning: unused variable 'primary_plane' [-Wunused-variable]
   96 |     struct drm_plane *primary_plane = crtc->primary;
      |                       ^~~~~~~~~~~~~
/usr/src/nvidia-510.47.03-grid/kernel/nvidia-peermem/nvidia-peermem.c: In function 'nv_mem_client_init':
/usr/src/nvidia-510.47.03-grid/kernel/nvidia-peermem/nvidia-peermem.c:445:5: warning: ISO C90 forbids mixed declarations and code [-Wdeclaration-after-statement]
  445 |     int status = 0;
      |     ^~~
Relinking NVIDIA driver kernel modules...
Building NVIDIA driver package nvidia-modules-5.4.0-97...
Installing NVIDIA driver kernel modules...

WARNING: The nvidia-drm module will not be installed. As a result, DRM-KMS will not function with this installation of the NVIDIA driver.


ERROR: Unable to open 'kernel/dkms.conf' for copying (No such file or directory)


WARNING: Ignoring CC version mismatch:

The kernel was built with gcc version 9.3.0 (Ubuntu 9.3.0-17ubuntu1~20.04), but the current compiler version is cc (Ubuntu 9.4.0-1ubuntu1~20.04) 9.4.0.


Welcome to the NVIDIA Software Installer for Unix/Linux

Detected 8 CPUs online; setting concurrency level to 8.
Installing NVIDIA driver version 510.47.03.
Performing CC sanity check with CC="/usr/bin/cc".
Performing CC check.
Kernel source path: '/lib/modules/5.4.0-97-generic/build'

Kernel output path: '/lib/modules/5.4.0-97-generic/build'

Performing Compiler check.
Performing Dom0 check.
Performing Xen check.
Performing PREEMPT_RT check.
Performing vgpu_kvm check.

The CC version check failed:

The kernel was built with gcc version 9.3.0 (Ubuntu 9.3.0-17ubuntu1~20.04), but the current compiler version is cc (Ubuntu 9.4.0-1ubuntu1~20.04) 9.4.0.

This may lead to subtle problems; if you are not certain whether the mismatched compiler will be compatible with your kernel, you may wish to abort installation, set the CC environment variable to the name of the compiler used to compile your kernel, and restart installation.
Valid responses are:
 (1)    "Ignore CC version check" [ default ]
 (2)    "Abort installation"
Please select your response by number or name:
The CC version check failed:

The kernel was built with gcc version 9.3.0 (Ubuntu 9.3.0-17ubuntu1~20.04), but the current compiler version is cc (Ubuntu 9.4.0-1ubuntu1~20.04) 9.4.0.

This may lead to subtle problems; if you are not certain whether the mismatched compiler will be compatible with your kernel, you may wish to abort installation, set the CC environment variable to the name of the compiler used to compile your kernel, and restart installation. (Answer: Ignore CC version check)
Cleaning kernel module build directory.
Building kernel modules
  : [##############################] 100%
Kernel module compilation complete.
Kernel messages:
[ 6698.314708] device nvidia-d-c7695b left promiscuous mode
[ 6702.389383] nvidia-modeset: Unloading
[ 6702.475656] nvidia-uvm: Unloaded the UVM driver.
[ 6702.496647] nvidia-nvlink: Unregistered the Nvlink Core, major device number 240
[ 6704.411846] IPv6: ADDRCONF(NETDEV_CHANGE): nvidia-c-b113b7: link becomes ready
[ 6704.411884] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[ 6704.415240] device nvidia-c-b113b7 entered promiscuous mode
[ 7383.185541] device gpu-oper-8a3d39 left promiscuous mode
[ 7388.199151] device nvidia-d-3fc71d left promiscuous mode
[ 7388.249135] device nvidia-c-b113b7 left promiscuous mode
[ 7395.059573] IPv6: ADDRCONF(NETDEV_CHANGE): gpu-oper-604275: link becomes ready
[ 7395.059605] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[ 7395.061026] device gpu-oper-604275 entered promiscuous mode
[ 7414.366835] IPv6: ADDRCONF(NETDEV_CHANGE): nvidia-c-3ef47a: link becomes ready
[ 7414.366887] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[ 7414.369379] device nvidia-c-3ef47a entered promiscuous mode
[ 7414.953808] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[ 7414.955892] device nvidia-d-d35927 entered promiscuous mode
[ 7515.170455] nvidia-nvlink: Nvlink Core is being initialized, major device number 239
[ 7515.172870] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  510.47.03  Mon Jan 24 22:58:54 UTC 2022
[ 7515.178556] nvidia-uvm: Loaded the UVM driver, major device number 237.
[ 7515.180335] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  510.47.03  Mon Jan 24 22:51:43 UTC 2022
[ 7515.184090] nvidia-modeset: Unloading
[ 7515.302270] nvidia-uvm: Unloaded the UVM driver.
[ 7515.329080] nvidia-nvlink: Unregistered the Nvlink Core, major device number 239
Installing 'NVIDIA Accelerated Graphics Driver for Linux-x86_64' (510.47.03):
  Installing: [##############################] 100%
Driver file installation is complete.
Running post-install sanity check:
  Checking: [##############################] 100%
Post-install sanity check passed.

Installation of the kernel module for the NVIDIA Accelerated Graphics Driver for Linux-x86_64 (version 510.47.03) is now complete.

Parsing kernel module parameters...
Loading ipmi and i2c_core kernel modules...
Loading NVIDIA driver kernel modules...
+ modprobe nvidia
+ modprobe nvidia-uvm
+ modprobe nvidia-modeset
+ set +o xtrace -o nounset
Starting NVIDIA persistence daemon...
Copying gridd.conf...
Copying ClientConfigToken...
Starting nvidia-gridd..
Starting nvidia-topologyd..
ls: cannot access '/proc/driver/nvidia-nvswitch/devices/*': No such file or directory
Mounting NVIDIA driver rootfs...
Done, now waiting for signal

Same with public available driver image nvcr.io/nvidia/driver:510.47.03-ubuntu20.04:

DRIVER_ARCH is x86_64
Creating directory NVIDIA-Linux-x86_64-510.47.03
Verifying archive integrity... OK
Uncompressing NVIDIA Accelerated Graphics Driver for Linux-x86_64 510.47.03..........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

WARNING: Unable to determine the default X library path. The path /tmp/null/lib will be used, but this path was not detected in the ldconfig(8) cache, and no directory exists at this path, so it is likely that libraries installed there will not be found by the loader.


WARNING: You specified the '--no-kernel-module' command line option, nvidia-installer will not install a kernel module as part of this driver installation, and it will not remove existing NVIDIA kernel modules not part of an earlier NVIDIA driver installation.  Please ensure that an NVIDIA kernel module matching this driver version is installed separately.


========== NVIDIA Software Installer ==========

Starting installation of NVIDIA driver version 510.47.03 for Linux kernel version 5.4.0-97-generic

Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...
Checking NVIDIA driver packages...
Updating the package cache...
W: Failed to fetch http://archive.ubuntu.com/ubuntu/dists/focal/InRelease  Temporary failure resolving 'archive.ubuntu.com'
W: Failed to fetch http://archive.ubuntu.com/ubuntu/dists/focal-updates/InRelease  Temporary failure resolving 'archive.ubuntu.com'
W: Failed to fetch http://archive.ubuntu.com/ubuntu/dists/focal-security/InRelease  Temporary failure resolving 'archive.ubuntu.com'
W: Failed to fetch https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/InRelease  Temporary failure resolving 'developer.download.nvidia.com'
W: Some index files failed to download. They have been ignored, or old ones used instead.
Resolving Linux kernel version...
Proceeding with Linux kernel version 5.4.0-97-generic
Installing Linux kernel headers...
E: Failed to fetch http://archive.ubuntu.com/ubuntu/pool/main/l/linux/linux-headers-5.4.0-97_5.4.0-97.110_all.deb  Temporary failure resolving 'archive.ubuntu.com'
E: Failed to fetch http://archive.ubuntu.com/ubuntu/pool/main/l/linux/linux-headers-5.4.0-97-generic_5.4.0-97.110_amd64.deb  Temporary failure resolving 'archive.ubuntu.com'
E: Unable to fetch some archives, maybe run apt-get update or try with --fix-missing?
Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...

Apr 20 '22 09:04 chrisholzheimer

@chrisholzheimer

apt seem to only raise a warning, but it seems nvidia-driver-daemonset pod still need packages from official repos to work:

Yes, these packages are required for the driver container to work.

If we rebuild image including our local repositorys the driver pod is working:

Instead of rebuilding the image, you can follow these instructions for mounting a custom repo config file into the driver daemonset so that your local repository mirror is used.

Apr 20 '22 16:04 cdesiniotis

Thank you @cdesiniotis, i did not notice this in the documentation. That seem to be something really new. However the repo config file works like a charm!

Additionally we need to add our CA trust chain. I tried it like it is described there, but the certificates are not added to the certificate store of the container. I was able to verify our CA files inside of the /etc/ssl/certs dir, but they are not added to the file ca-certificates.crt. I needed to cp the ca files manually to /usr/local/share/ca-certifcates and execute update-ca-certificates to get it added. For me it seems so, that is not sufficient to just mounting the files into the /etc/ssl/certs directory.

Apr 21 '22 11:04 chrisholzheimer

@chrisholzheimer this has been fixed with v23.3.0. Please try out with latest driver images and re-open if you still see this issue.

Apr 05 '23 06:04 shivamerla

gpu-operator gpu-operator copied to clipboard

1.9.0 chart issue with repo-config in air-gapped env

gpu-operator
gpu-operator copied to clipboard