gpu-operator
gpu-operator copied to clipboard
1.9.0 chart issue with repo-config in air-gapped env
- Issue when configuring local repo configMap for air-gapped env using 1.9.0 operator chart. Default CentOS and cuda repos are still being used/configured in /etc/yum.repos.d/ in nvidia-driver-daemonset pod
...
Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...
Checking NVIDIA driver packages...
Updating the package cache...
Repository base is listed more than once in the configuration
Repository updates is listed more than once in the configuration
Repository extras is listed more than once in the configuration
Repository centosplus is listed more than once in the configuration
Repository cuda is listed more than once in the configuration
Could not retrieve mirrorlist http://mirrorlist.centos.org/?release=7&arch=x86_64&repo=os&infra=container error was
14: curl#7 - "Failed to connect to 2001:4178:5:200::10: Network is unreachable"
...
Cannot find a valid baseurl for repo: base/7/x86_64
Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...
Hi @yug0slav
Can you try naming your repo configuration files CentOS-Vault.repo
and cuda.repo
and retry this? Naming them as such will replace the existing repo configuration files in the driver image at /etc/yum.repos.d/
, which will avoid this error.
apt
will raise a warning for repo's it cannot reach, but it will not fail. yum
appears to behave differently.
-
cert-config CM described in air-gapped docs isn't working...
- looks like certs are being dropped into /etc/pki/ca-trust/extracted/pem/ but not actually installed.
- drop into /etc/pki/ca-trust/sources/anchors/ and update-ca-trust instead???
-
switched to HTTP and change repo-config to make it work
apiVersion: v1
kind: ConfigMap
metadata:
name: repo-config
namespace: gpu-operator
data:
CentOS-Base.repo: |
[base]
name=CentOS-7 - Base
...
CentOS-Updates.repo: |
[updates]
name=CentOS-7 - Updates
...
CentOS-Extras.repo: |
[extras]
name=CentOS-7 - Extras
...
CentOS-Plus.repo: |
[centosplus]
name=CentOS-7 - Plus
...
EPEL.repo: |
[epel]
name=CentOS-7 - EPEL
...
cuda.repo: |
[cuda]
name=cuda
...
Thanks for more details. If I am understanding you correctly, there are two issues. Correct me if I am wrong.
- On CentOS 7, you have to name your repo config files exactly the same as the original ones you are trying to replace (e.g.
CentOS-Base.repo
,cuda.repo
).yum
will complain about duplicate repo entries otherwise. - Mounting custom keys/certificates through the
cert-config
ConfigMap is not functional.
For 1, we will improve our documentation.
For 2, we will need to bring up a proper "air-gapped" environment for CentOS 7 and get back to you. Are you able to use HTTP for now and successfully install GPU Operator 1.9 in your air-gapped environment?
I am using HTTP for now... and yes cert-config isn't functioning in centos pod.
Hi @yug0slav
Can you try naming your repo configuration files
CentOS-Vault.repo
andcuda.repo
and retry this? Naming them as such will replace the existing repo configuration files in the driver image at/etc/yum.repos.d/
, which will avoid this error.
apt
will raise a warning for repo's it cannot reach, but it will not fail.yum
appears to behave differently.
Hi @cdesiniotis,
apt
seem to only raise a warning, but it seems nvidia-driver-daemonset pod still need packages from official repos to work:
Using official vGPU driver image from AI Enterprise nvcr.io/nvaie/vgpu-guest-driver-2-0:510.47.03-ubuntu20.04
in airgap environment:
DRIVER_ARCH is x86_64
found 1 vgpu devices on host
vgpu driver version selected: 510.47.03-grid
Creating directory NVIDIA-Linux-x86_64-510.47.03-grid
Verifying archive integrity... OK
Uncompressing NVIDIA Accelerated Graphics Driver for Linux-x86_64 510.47.03......................................................................... .................................................................................................................................................... .................................................................................................................................................... .................................................................................................................................................... .........................................................................................
WARNING: Unable to determine the default X library path. The path /tmp/null/lib will be used, but this path was not detected in the ldconfig(8) cach e, and no directory exists at this path, so it is likely that libraries installed there will not be found by the loader.
WARNING: You specified the '--no-kernel-module' command line option, nvidia-installer will not install a kernel module as part of this driver instal lation, and it will not remove existing NVIDIA kernel modules not part of an earlier NVIDIA driver installation. Please ensure that an NVIDIA kerne l module matching this driver version is installed separately.
========== NVIDIA Software Installer ==========
Starting installation of NVIDIA driver version 510.47.03-grid for Linux kernel version 5.4.0-97-generic
Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...
Checking NVIDIA driver packages...
Updating the package cache...
W: Failed to fetch http://archive.ubuntu.com/ubuntu/dists/focal/InRelease Temporary failure resolving 'archive.ubuntu.com'
W: Failed to fetch http://archive.ubuntu.com/ubuntu/dists/focal-updates/InRelease Temporary failure resolving 'archive.ubuntu.com'
W: Failed to fetch http://archive.ubuntu.com/ubuntu/dists/focal-security/InRelease Temporary failure resolving 'archive.ubuntu.com'
W: Failed to fetch https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/InRelease Temporary failure resolving 'developer.down load.nvidia.com'
W: Some index files failed to download. They have been ignored, or old ones used instead.
Resolving Linux kernel version...
Could not resolve Linux kernel version
Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...
If we rebuild image including our local repositorys the driver pod is working:
DRIVER_ARCH is x86_64
found 1 vgpu devices on host
vgpu driver version selected: 510.47.03-grid
Creating directory NVIDIA-Linux-x86_64-510.47.03-grid
Verifying archive integrity... OK
Uncompressing NVIDIA Accelerated Graphics Driver for Linux-x86_64 510.47.03..............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
WARNING: Unable to determine the default X library path. The path /tmp/null/lib will be used, but this path was not detected in the ldconfig(8) cache, and no directory exists at this path, so it is likely that libraries installed there will not be found by the loader.
WARNING: You specified the '--no-kernel-module' command line option, nvidia-installer will not install a kernel module as part of this driver installation, and it will not remove existing NVIDIA kernel modules not part of an earlier NVIDIA driver installation. Please ensure that an NVIDIA kernel module matching this driver version is installed separately.
========== NVIDIA Software Installer ==========
Starting installation of NVIDIA driver version 510.47.03-grid for Linux kernel version 5.4.0-97-generic
Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...
Checking NVIDIA driver packages...
Updating the package cache...
Resolving Linux kernel version...
Proceeding with Linux kernel version 5.4.0-97-generic
Installing Linux kernel headers...
Installing Linux kernel module files...
Generating Linux kernel version string...
Compiling NVIDIA driver kernel modules...
/usr/src/nvidia-510.47.03-grid/kernel/nvidia/nv-dma.c:986: warning: "IMPORT_SGT_STUBS_NEEDED" redefined
986 | #define IMPORT_SGT_STUBS_NEEDED 0
|
/usr/src/nvidia-510.47.03-grid/kernel/nvidia/nv-dma.c:980: note: this is the location of the previous definition
980 | #define IMPORT_SGT_STUBS_NEEDED 1
|
/usr/src/nvidia-510.47.03-grid/kernel/nvidia/nv-procfs.o: warning: objtool: .text.unlikely: unexpected end of section
/usr/src/nvidia-510.47.03-grid/kernel/nvidia-uvm/uvm.c: In function 'uvm_mmap':
/usr/src/nvidia-510.47.03-grid/kernel/nvidia-uvm/uvm.c:887:1: warning: label 'out_va_space_unlock' defined but not used [-Wunused-label]
887 | out_va_space_unlock:
| ^~~~~~~~~~~~~~~~~~~
/usr/src/nvidia-510.47.03-grid/kernel/nvidia-drm/nvidia-drm-crtc.c: In function 'cursor_plane_req_config_update':
/usr/src/nvidia-510.47.03-grid/kernel/nvidia-drm/nvidia-drm-crtc.c:81:32: warning: unused variable 'nv_drm_plane_state' [-Wunused-variable]
81 | struct nv_drm_plane_state *nv_drm_plane_state =
| ^~~~~~~~~~~~~~~~~~
/usr/src/nvidia-510.47.03-grid/kernel/nvidia-drm/nvidia-drm-crtc.c:80:27: warning: unused variable 'nv_dev' [-Wunused-variable]
80 | struct nv_drm_device *nv_dev = to_nv_device(plane->dev);
| ^~~~~~
/usr/src/nvidia-510.47.03-grid/kernel/nvidia-drm/nvidia-drm-crtc.c: In function 'plane_req_config_update':
/usr/src/nvidia-510.47.03-grid/kernel/nvidia-drm/nvidia-drm-crtc.c:182:9: warning: unused variable 'ret' [-Wunused-variable]
182 | int ret = 0;
| ^~~
/usr/src/nvidia-510.47.03-grid/kernel/nvidia-drm/nvidia-drm-crtc.c: In function 'nv_drm_plane_atomic_set_property':
/usr/src/nvidia-510.47.03-grid/kernel/nvidia-drm/nvidia-drm-crtc.c:497:32: warning: unused variable 'nv_drm_plane_state' [-Wunused-variable]
497 | struct nv_drm_plane_state *nv_drm_plane_state =
| ^~~~~~~~~~~~~~~~~~
/usr/src/nvidia-510.47.03-grid/kernel/nvidia-drm/nvidia-drm-crtc.c: In function 'nv_drm_enumerate_crtcs_and_planes':
/usr/src/nvidia-510.47.03-grid/kernel/nvidia-drm/nvidia-drm-crtc.c:1141:13: warning: ISO C90 forbids mixed declarations and code [-Wdeclaration-after-statement]
1141 | struct drm_plane *overlay_plane =
| ^~~~~~
/usr/src/nvidia-510.47.03-grid/kernel/nvidia-drm/nvidia-drm-modeset.c: In function '__will_generate_flip_event':
/usr/src/nvidia-510.47.03-grid/kernel/nvidia-drm/nvidia-drm-modeset.c:98:10: warning: unused variable 'overlay_event' [-Wunused-variable]
98 | bool overlay_event = false;
| ^~~~~~~~~~~~~
/usr/src/nvidia-510.47.03-grid/kernel/nvidia-drm/nvidia-drm-modeset.c:97:10: warning: unused variable 'primary_event' [-Wunused-variable]
97 | bool primary_event = false;
| ^~~~~~~~~~~~~
/usr/src/nvidia-510.47.03-grid/kernel/nvidia-drm/nvidia-drm-modeset.c:96:23: warning: unused variable 'primary_plane' [-Wunused-variable]
96 | struct drm_plane *primary_plane = crtc->primary;
| ^~~~~~~~~~~~~
/usr/src/nvidia-510.47.03-grid/kernel/nvidia-peermem/nvidia-peermem.c: In function 'nv_mem_client_init':
/usr/src/nvidia-510.47.03-grid/kernel/nvidia-peermem/nvidia-peermem.c:445:5: warning: ISO C90 forbids mixed declarations and code [-Wdeclaration-after-statement]
445 | int status = 0;
| ^~~
Relinking NVIDIA driver kernel modules...
Building NVIDIA driver package nvidia-modules-5.4.0-97...
Installing NVIDIA driver kernel modules...
WARNING: The nvidia-drm module will not be installed. As a result, DRM-KMS will not function with this installation of the NVIDIA driver.
ERROR: Unable to open 'kernel/dkms.conf' for copying (No such file or directory)
WARNING: Ignoring CC version mismatch:
The kernel was built with gcc version 9.3.0 (Ubuntu 9.3.0-17ubuntu1~20.04), but the current compiler version is cc (Ubuntu 9.4.0-1ubuntu1~20.04) 9.4.0.
Welcome to the NVIDIA Software Installer for Unix/Linux
Detected 8 CPUs online; setting concurrency level to 8.
Installing NVIDIA driver version 510.47.03.
Performing CC sanity check with CC="/usr/bin/cc".
Performing CC check.
Kernel source path: '/lib/modules/5.4.0-97-generic/build'
Kernel output path: '/lib/modules/5.4.0-97-generic/build'
Performing Compiler check.
Performing Dom0 check.
Performing Xen check.
Performing PREEMPT_RT check.
Performing vgpu_kvm check.
The CC version check failed:
The kernel was built with gcc version 9.3.0 (Ubuntu 9.3.0-17ubuntu1~20.04), but the current compiler version is cc (Ubuntu 9.4.0-1ubuntu1~20.04) 9.4.0.
This may lead to subtle problems; if you are not certain whether the mismatched compiler will be compatible with your kernel, you may wish to abort installation, set the CC environment variable to the name of the compiler used to compile your kernel, and restart installation.
Valid responses are:
(1) "Ignore CC version check" [ default ]
(2) "Abort installation"
Please select your response by number or name:
The CC version check failed:
The kernel was built with gcc version 9.3.0 (Ubuntu 9.3.0-17ubuntu1~20.04), but the current compiler version is cc (Ubuntu 9.4.0-1ubuntu1~20.04) 9.4.0.
This may lead to subtle problems; if you are not certain whether the mismatched compiler will be compatible with your kernel, you may wish to abort installation, set the CC environment variable to the name of the compiler used to compile your kernel, and restart installation. (Answer: Ignore CC version check)
Cleaning kernel module build directory.
Building kernel modules
: [##############################] 100%
Kernel module compilation complete.
Kernel messages:
[ 6698.314708] device nvidia-d-c7695b left promiscuous mode
[ 6702.389383] nvidia-modeset: Unloading
[ 6702.475656] nvidia-uvm: Unloaded the UVM driver.
[ 6702.496647] nvidia-nvlink: Unregistered the Nvlink Core, major device number 240
[ 6704.411846] IPv6: ADDRCONF(NETDEV_CHANGE): nvidia-c-b113b7: link becomes ready
[ 6704.411884] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[ 6704.415240] device nvidia-c-b113b7 entered promiscuous mode
[ 7383.185541] device gpu-oper-8a3d39 left promiscuous mode
[ 7388.199151] device nvidia-d-3fc71d left promiscuous mode
[ 7388.249135] device nvidia-c-b113b7 left promiscuous mode
[ 7395.059573] IPv6: ADDRCONF(NETDEV_CHANGE): gpu-oper-604275: link becomes ready
[ 7395.059605] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[ 7395.061026] device gpu-oper-604275 entered promiscuous mode
[ 7414.366835] IPv6: ADDRCONF(NETDEV_CHANGE): nvidia-c-3ef47a: link becomes ready
[ 7414.366887] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[ 7414.369379] device nvidia-c-3ef47a entered promiscuous mode
[ 7414.953808] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[ 7414.955892] device nvidia-d-d35927 entered promiscuous mode
[ 7515.170455] nvidia-nvlink: Nvlink Core is being initialized, major device number 239
[ 7515.172870] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 510.47.03 Mon Jan 24 22:58:54 UTC 2022
[ 7515.178556] nvidia-uvm: Loaded the UVM driver, major device number 237.
[ 7515.180335] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms 510.47.03 Mon Jan 24 22:51:43 UTC 2022
[ 7515.184090] nvidia-modeset: Unloading
[ 7515.302270] nvidia-uvm: Unloaded the UVM driver.
[ 7515.329080] nvidia-nvlink: Unregistered the Nvlink Core, major device number 239
Installing 'NVIDIA Accelerated Graphics Driver for Linux-x86_64' (510.47.03):
Installing: [##############################] 100%
Driver file installation is complete.
Running post-install sanity check:
Checking: [##############################] 100%
Post-install sanity check passed.
Installation of the kernel module for the NVIDIA Accelerated Graphics Driver for Linux-x86_64 (version 510.47.03) is now complete.
Parsing kernel module parameters...
Loading ipmi and i2c_core kernel modules...
Loading NVIDIA driver kernel modules...
+ modprobe nvidia
+ modprobe nvidia-uvm
+ modprobe nvidia-modeset
+ set +o xtrace -o nounset
Starting NVIDIA persistence daemon...
Copying gridd.conf...
Copying ClientConfigToken...
Starting nvidia-gridd..
Starting nvidia-topologyd..
ls: cannot access '/proc/driver/nvidia-nvswitch/devices/*': No such file or directory
Mounting NVIDIA driver rootfs...
Done, now waiting for signal
Same with public available driver image nvcr.io/nvidia/driver:510.47.03-ubuntu20.04
:
DRIVER_ARCH is x86_64
Creating directory NVIDIA-Linux-x86_64-510.47.03
Verifying archive integrity... OK
Uncompressing NVIDIA Accelerated Graphics Driver for Linux-x86_64 510.47.03..........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
WARNING: Unable to determine the default X library path. The path /tmp/null/lib will be used, but this path was not detected in the ldconfig(8) cache, and no directory exists at this path, so it is likely that libraries installed there will not be found by the loader.
WARNING: You specified the '--no-kernel-module' command line option, nvidia-installer will not install a kernel module as part of this driver installation, and it will not remove existing NVIDIA kernel modules not part of an earlier NVIDIA driver installation. Please ensure that an NVIDIA kernel module matching this driver version is installed separately.
========== NVIDIA Software Installer ==========
Starting installation of NVIDIA driver version 510.47.03 for Linux kernel version 5.4.0-97-generic
Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...
Checking NVIDIA driver packages...
Updating the package cache...
W: Failed to fetch http://archive.ubuntu.com/ubuntu/dists/focal/InRelease Temporary failure resolving 'archive.ubuntu.com'
W: Failed to fetch http://archive.ubuntu.com/ubuntu/dists/focal-updates/InRelease Temporary failure resolving 'archive.ubuntu.com'
W: Failed to fetch http://archive.ubuntu.com/ubuntu/dists/focal-security/InRelease Temporary failure resolving 'archive.ubuntu.com'
W: Failed to fetch https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/InRelease Temporary failure resolving 'developer.download.nvidia.com'
W: Some index files failed to download. They have been ignored, or old ones used instead.
Resolving Linux kernel version...
Proceeding with Linux kernel version 5.4.0-97-generic
Installing Linux kernel headers...
E: Failed to fetch http://archive.ubuntu.com/ubuntu/pool/main/l/linux/linux-headers-5.4.0-97_5.4.0-97.110_all.deb Temporary failure resolving 'archive.ubuntu.com'
E: Failed to fetch http://archive.ubuntu.com/ubuntu/pool/main/l/linux/linux-headers-5.4.0-97-generic_5.4.0-97.110_amd64.deb Temporary failure resolving 'archive.ubuntu.com'
E: Unable to fetch some archives, maybe run apt-get update or try with --fix-missing?
Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...
@chrisholzheimer
apt seem to only raise a warning, but it seems nvidia-driver-daemonset pod still need packages from official repos to work:
Yes, these packages are required for the driver container to work.
If we rebuild image including our local repositorys the driver pod is working:
Instead of rebuilding the image, you can follow these instructions for mounting a custom repo config file into the driver daemonset so that your local repository mirror is used.
Thank you @cdesiniotis, i did not notice this in the documentation. That seem to be something really new. However the repo config file works like a charm!
Additionally we need to add our CA trust chain. I tried it like it is described there, but the certificates are not added to the certificate store of the container. I was able to verify our CA files inside of the /etc/ssl/certs
dir, but they are not added to the file ca-certificates.crt
. I needed to cp
the ca files manually to /usr/local/share/ca-certifcates
and execute update-ca-certificates
to get it added. For me it seems so, that is not sufficient to just mounting the files into the /etc/ssl/certs
directory.
@chrisholzheimer this has been fixed with v23.3.0. Please try out with latest driver images and re-open if you still see this issue.