gpu-operator icon indicating copy to clipboard operation
gpu-operator copied to clipboard

Latest CRI-O (on 1.25/1.26) failing to install gpu-operator

Open KodieGlosserIBM opened this issue 11 months ago • 17 comments

The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.

Important Note: NVIDIA AI Enterprise customers can get support from NVIDIA Enterprise support. Please open a case here.

tl;dr at the bottom

1. Quick Debug Information

  • OS/Version(e.g. RHEL8.6, Ubuntu22.04): RHEL8.9
  • Kernel Version: 4.18.0-513.18.1.el8_9.x86_64
  • Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): cri-o (version 1.25 and 1.26)
  • K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): OCP (version 4.12, 4.13)
  • GPU Operator Version: 23.9.2 (latest)

2. Issue or feature description

Briefly explain the issue in terms of expected behavior and current behavior. Pulling in the most recent cri-o changes on OCP 4.12/4.13 https://github.com/cri-o/cri-o/compare/1b1a520...8724c4d CRI-O: cri-o-1.25.5-10.rhaos4.12.git8724c4d.el8 cri-o-1.26.5-7.rhaos4.13.git692ef91.el8

GPU installer is failling to install elfutils

Installing elfutils...
+ echo 'Installing elfutils...'
+ dnf install -q -y elfutils-libelf.x86_64 elfutils-libelf-devel.x86_64
Error: Unable to find a match: elfutils-libelf-devel.x86_64
FATAL: failed to install elfutils packages. RHEL entitlement may be improperly deployed.
+ echo 'FATAL: failed to install elfutils packages. RHEL entitlement may be improperly deployed.'
+ exit 1

3. Steps to reproduce the issue

Detailed steps to reproduce the issue.

Use container runtime crio-o on version(s) cri-o-1.25.5-10.rhaos4.12.git8724c4d.el8 cri-o-1.26.5-7.rhaos4.13.git692ef91.el8

4. Information to attach (optional if deemed irrelevant)

  • [x] kubernetes pods status: kubectl get pods -n OPERATOR_NAMESPACE
k get pods -n gpu-operator-resources -o wide   
NAME                                       READY   STATUS             RESTARTS        AGE     IP               NODE          NOMINATED NODE   READINESS GATES
gpu-feature-discovery-9qq4g                0/1     Init:0/1           0               3h28m   172.17.162.202   10.180.8.40   <none>           <none>
gpu-operator-8b54f655-45f6k                1/1     Running            0               3h46m   172.17.162.253   10.180.8.40   <none>           <none>
nfd-controller-manager-5988c689d-ddg4q     2/2     Running            0               4h      172.17.162.248   10.180.8.40   <none>           <none>
nfd-master-966d4c54c-l7mv4                 1/1     Running            0               3h47m   172.17.162.251   10.180.8.40   <none>           <none>
nfd-worker-48tnv                           1/1     Running            0               3h47m   10.180.8.39      10.180.8.39   <none>           <none>
nfd-worker-8n856                           1/1     Running            1 (3h47m ago)   3h47m   10.180.8.40      10.180.8.40   <none>           <none>
nfd-worker-jqssq                           1/1     Running            0               3h47m   10.180.8.38      10.180.8.38   <none>           <none>
nvidia-container-toolkit-daemonset-n7k8z   0/1     Init:0/1           0               3h28m   172.17.162.230   10.180.8.40   <none>           <none>
nvidia-dcgm-exporter-hmg7m                 0/1     Init:0/1           0               3h28m   10.180.8.40      10.180.8.40   <none>           <none>
nvidia-dcgm-fnz2j                          0/1     Init:0/1           0               3h28m   10.180.8.40      10.180.8.40   <none>           <none>
nvidia-device-plugin-daemonset-zcbbl       0/1     Init:0/1           0               3h28m   172.17.162.254   10.180.8.40   <none>           <none>
nvidia-driver-daemonset-c9zfb              0/1     CrashLoopBackOff   43 (5m ago)     3h29m   172.17.162.220   10.180.8.40   <none>           <none>
nvidia-node-status-exporter-b6gxl          1/1     Running            0               3h43m   172.17.162.198   10.180.8.40   <none>           <none>
nvidia-operator-validator-zr6sb            0/1     Init:0/4           0               3h28m   172.17.162.252   10.180.8.40   <none>           <none>
  • [x] kubernetes daemonset status: kubectl get ds -n OPERATOR_NAMESPACE
k get ds -n gpu-operator-resources                               
NAME                                 DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                                      AGE
gpu-feature-discovery                1         1         0       1            0           nvidia.com/gpu.deploy.gpu-feature-discovery=true   3h43m
nfd-worker                           3         3         3       3            3           <none>                                             3h48m
nvidia-container-toolkit-daemonset   1         1         0       1            0           nvidia.com/gpu.deploy.container-toolkit=true       3h43m
nvidia-dcgm                          1         1         0       1            0           nvidia.com/gpu.deploy.dcgm=true                    3h43m
nvidia-dcgm-exporter                 1         1         0       1            0           nvidia.com/gpu.deploy.dcgm-exporter=true           3h43m
nvidia-device-plugin-daemonset       1         1         0       1            0           nvidia.com/gpu.deploy.device-plugin=true           3h43m
nvidia-driver-daemonset              1         1         0       1            0           nvidia.com/gpu.deploy.driver=true                  3h43m
nvidia-mig-manager                   0         0         0       0            0           nvidia.com/gpu.deploy.mig-manager=true             3h43m
nvidia-node-status-exporter          1         1         1       1            1           nvidia.com/gpu.deploy.node-status-exporter=true    3h43m
nvidia-operator-validator            1         1         0       1            0           nvidia.com/gpu.deploy.operator-validator=true      3h43m
  • [x] If a pod/ds is in an error state or pending state kubectl describe pod -n OPERATOR_NAMESPACE POD_NAME
k describe pod -n gpu-operator-resources nvidia-driver-daemonset-c9zfb 
Name:                 nvidia-driver-daemonset-c9zfb
Namespace:            gpu-operator-resources
Priority:             2000001000
Priority Class Name:  system-node-critical
Service Account:      nvidia-driver
Node:                 10.180.8.40/10.180.8.40
Start Time:           Mon, 11 Mar 2024 16:59:01 -0500
Labels:               app=nvidia-driver-daemonset
                      app.kubernetes.io/component=nvidia-driver
                      controller-revision-hash=dc74cc498
                      nvidia.com/precompiled=false
                      pod-template-generation=3
Annotations:          cni.projectcalico.org/containerID: bf904eb1f2c645c2c74a61a73f0a1d70d4a530fcf971142816c6c05163b332d6
                      cni.projectcalico.org/podIP: 172.17.162.220/32
                      cni.projectcalico.org/podIPs: 172.17.162.220/32
                      k8s.v1.cni.cncf.io/network-status:
                        [{
                            "name": "k8s-pod-network",
                            "ips": [
                                "172.17.162.220"
                            ],
                            "default": true,
                            "dns": {}
                        }]
                      kubectl.kubernetes.io/default-container: nvidia-driver-ctr
                      openshift.io/scc: nvidia-driver
Status:               Running
IP:                   172.17.162.220
IPs:
  IP:           172.17.162.220
Controlled By:  DaemonSet/nvidia-driver-daemonset
Init Containers:
  k8s-driver-manager:
    Container ID:  cri-o://51588c2c91637fbdaaa68b22e2b9100199a5b8e3afa0b98ea471f1acdc64a716
    Image:         nvcr.io/nvidia/cloud-native/k8s-driver-manager@sha256:27c44f4720a4abf780217bd5e7903e4a008ebdbcf71238c4f106a0c22654776c
    Image ID:      nvcr.io/nvidia/cloud-native/k8s-driver-manager@sha256:27c44f4720a4abf780217bd5e7903e4a008ebdbcf71238c4f106a0c22654776c
    Port:          <none>
    Host Port:     <none>
    Command:
      driver-manager
    Args:
      uninstall_driver
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Mon, 11 Mar 2024 16:59:03 -0500
      Finished:     Mon, 11 Mar 2024 16:59:37 -0500
    Ready:          True
    Restart Count:  0
    Environment:
      NODE_NAME:                    (v1:spec.nodeName)
      NVIDIA_VISIBLE_DEVICES:      void
      ENABLE_GPU_POD_EVICTION:     true
      ENABLE_AUTO_DRAIN:           true
      DRAIN_USE_FORCE:             false
      DRAIN_POD_SELECTOR_LABEL:    
      DRAIN_TIMEOUT_SECONDS:       0s
      DRAIN_DELETE_EMPTYDIR_DATA:  false
      OPERATOR_NAMESPACE:          gpu-operator-resources (v1:metadata.namespace)
    Mounts:
      /host from host-root (ro)
      /run/nvidia from run-nvidia (rw)
      /sys from host-sys (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-m9xwl (ro)
Containers:
  nvidia-driver-ctr:
    Container ID:  cri-o://ce75d5618b76c3fd6febf508a4d142f66ca6c46040f2c7f8a74bc0cbc88ceeb4
    Image:         nvcr.io/nvidia/driver@sha256:6f51a22e01fd08ab0fde543e0c4dc6d7f7abb0f20d38205a98f3f1716cb3d7d3
    Image ID:      nvcr.io/nvidia/driver@sha256:6f51a22e01fd08ab0fde543e0c4dc6d7f7abb0f20d38205a98f3f1716cb3d7d3
    Port:          <none>
    Host Port:     <none>
    Command:
      nvidia-driver
    Args:
      init
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Mon, 11 Mar 2024 20:28:37 -0500
      Finished:     Mon, 11 Mar 2024 20:28:49 -0500
    Ready:          False
    Restart Count:  44
    Startup:        exec [sh -c nvidia-smi && touch /run/nvidia/validations/.driver-ctr-ready] delay=60s timeout=60s period=10s #success=1 #failure=120
    Environment:    <none>
    Mounts:
      /dev/log from dev-log (rw)
      /host-etc/os-release from host-os-release (ro)
      /lib/firmware from nv-firmware (rw)
      /run/mellanox/drivers from run-mellanox-drivers (rw)
      /run/mellanox/drivers/usr/src from mlnx-ofed-usr-src (rw)
      /run/nvidia from run-nvidia (rw)
      /run/nvidia-topologyd from run-nvidia-topologyd (rw)
      /sys/devices/system/memory/auto_online_blocks from sysfs-memory-online (rw)
      /sys/module/firmware_class/parameters/path from firmware-search-path (rw)
      /var/log from var-log (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-m9xwl (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  run-nvidia:
    Type:          HostPath (bare host directory volume)
    Path:          /run/nvidia
    HostPathType:  DirectoryOrCreate
  var-log:
    Type:          HostPath (bare host directory volume)
    Path:          /var/log
    HostPathType:  
  dev-log:
    Type:          HostPath (bare host directory volume)
    Path:          /dev/log
    HostPathType:  
  host-os-release:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/os-release
    HostPathType:  
  run-nvidia-topologyd:
    Type:          HostPath (bare host directory volume)
    Path:          /run/nvidia-topologyd
    HostPathType:  DirectoryOrCreate
  mlnx-ofed-usr-src:
    Type:          HostPath (bare host directory volume)
    Path:          /run/mellanox/drivers/usr/src
    HostPathType:  DirectoryOrCreate
  run-mellanox-drivers:
    Type:          HostPath (bare host directory volume)
    Path:          /run/mellanox/drivers
    HostPathType:  DirectoryOrCreate
  run-nvidia-validations:
    Type:          HostPath (bare host directory volume)
    Path:          /run/nvidia/validations
    HostPathType:  DirectoryOrCreate
  host-root:
    Type:          HostPath (bare host directory volume)
    Path:          /
    HostPathType:  
  host-sys:
    Type:          HostPath (bare host directory volume)
    Path:          /sys
    HostPathType:  Directory
  firmware-search-path:
    Type:          HostPath (bare host directory volume)
    Path:          /sys/module/firmware_class/parameters/path
    HostPathType:  
  sysfs-memory-online:
    Type:          HostPath (bare host directory volume)
    Path:          /sys/devices/system/memory/auto_online_blocks
    HostPathType:  
  nv-firmware:
    Type:          HostPath (bare host directory volume)
    Path:          /run/nvidia/driver/lib/firmware
    HostPathType:  DirectoryOrCreate
  kube-api-access-m9xwl:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
    ConfigMapName:           openshift-service-ca.crt
    ConfigMapOptional:       <nil>
QoS Class:                   BestEffort
Node-Selectors:              nvidia.com/gpu.deploy.driver=true
Tolerations:                 node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                             node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists
                             node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                             node.kubernetes.io/unreachable:NoExecute op=Exists
                             node.kubernetes.io/unschedulable:NoSchedule op=Exists
                             nvidia.com/gpu:NoSchedule op=Exists
Events:
  Type     Reason   Age                    From     Message
  ----     ------   ----                   ----     -------
  Normal   Pulled   80m (x29 over 3h29m)   kubelet  Container image "nvcr.io/nvidia/driver@sha256:6f51a22e01fd08ab0fde543e0c4dc6d7f7abb0f20d38205a98f3f1716cb3d7d3" already present on machine
  Warning  BackOff  29s (x947 over 3h29m)  kubelet  Back-off restarting failed container nvidia-driver-ctr in pod nvidia-driver-daemonset-c9zfb_gpu-operator-resources(8a0d8d4f-9f88-4c42-930d-508ab7653a98)
  • [x] If a pod/ds is in an error state or pending state kubectl logs -n OPERATOR_NAMESPACE POD_NAME --all-containers
k logs -n gpu-operator-resources nvidia-driver-daemonset-c9zfb -c nvidia-driver-ctr -p
+ set -eu
+ RUN_DIR=/run/nvidia
+ PID_FILE=/run/nvidia/nvidia-driver.pid
+ DRIVER_VERSION=550.54.14
+ KERNEL_UPDATE_HOOK=/run/kernel/postinst.d/update-nvidia-driver
+ NUM_VGPU_DEVICES=0
+ NVIDIA_MODULE_PARAMS=()
+ NVIDIA_UVM_MODULE_PARAMS=()
DRIVER_ARCH is x86_64
+ NVIDIA_MODESET_MODULE_PARAMS=()
+ NVIDIA_PEERMEM_MODULE_PARAMS=()
+ TARGETARCH=amd64
+ USE_HOST_MOFED=false
+ DNF_RELEASEVER=
+ RHEL_VERSION=
+ RHEL_MAJOR_VERSION=8
+ OPEN_KERNEL_MODULES_ENABLED=false
+ [[ false == \t\r\u\e ]]
+ KERNEL_TYPE=kernel
+ DRIVER_ARCH=x86_64
+ DRIVER_ARCH=x86_64
+ echo 'DRIVER_ARCH is x86_64'
+++ dirname -- /usr/local/bin/nvidia-driver
++ cd -- /usr/local/bin
++ pwd
+ SCRIPT_DIR=/usr/local/bin
+ source /usr/local/bin/common.sh
++ GPU_DIRECT_RDMA_ENABLED=false
++ GDS_ENABLED=false
++ GDRCOPY_ENABLED=false
+ '[' 1 -eq 0 ']'
+ command=init
+ shift
+ case "${command}" in
++ getopt -l accept-license -o a --
+ options=' --'
+ '[' 0 -ne 0 ']'
+ eval set -- ' --'
++ set -- --
+ ACCEPT_LICENSE=
++ uname -r
+ KERNEL_VERSION=4.18.0-513.18.1.el8_9.x86_64
+ PRIVATE_KEY=
+ PACKAGE_TAG=
+ for opt in ${options}
+ case "$opt" in
+ shift
+ break
+ '[' 0 -ne 0 ']'
+ _resolve_rhel_version
+ _get_rhel_version_from_kernel
+ local rhel_version_underscore rhel_version_arr
++ echo 4.18.0-513.18.1.el8_9.x86_64
++ sed 's/.*el\([0-9]\+_[0-9]\+\).*/\1/g'
+ rhel_version_underscore=8_9
+ [[ ! 8_9 =~ ^[0-9]+_[0-9]+$ ]]
+ IFS=_
+ read -r -a rhel_version_arr
+ [[ 2 -ne 2 ]]
+ RHEL_VERSION=8.9
+ echo 'RHEL VERSION successfully resolved from kernel: 8.9'
RHEL VERSION successfully resolved from kernel: 8.9
+ return 0
+ [[ -z '' ]]
+ DNF_RELEASEVER=8.9
+ return 0
+ init
+ _prepare_exclusive
+ _prepare
+ '[' passthrough = vgpu ']'
+ sh NVIDIA-Linux-x86_64-550.54.14.run -x
Creating directory NVIDIA-Linux-x86_64-550.54.14
Verifying archive integrity... OK
Uncompressing NVIDIA Accelerated Graphics Driver for Linux-x86_64 550.54.14........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
+ cd NVIDIA-Linux-x86_64-550.54.14
+ sh /tmp/install.sh nvinstall
DRIVER_ARCH is x86_64

WARNING: Unable to determine the default X library path. The path /tmp/null/lib will be used, but this path was not detected in the ldconfig(8) cache, and no directory exists at this path, so it is likely that libraries installed there will not be found by the loader.


WARNING: You specified the '--no-kernel-modules' command line option, nvidia-installer will not install any kernel modules as part of this driver installation, and it will not remove existing NVIDIA kernel modules not part of an earlier NVIDIA driver installation.  Please ensure that NVIDIA kernel modules matching this driver version are installed separately.


WARNING: This NVIDIA driver package includes Vulkan components, but no Vulkan ICD loader was detected on this system. The NVIDIA Vulkan ICD will not function without the loader. Most distributions package the Vulkan loader; try installing the "vulkan-loader", "vulkan-icd-loader", or "libvulkan1" package.


WARNING: Unable to determine the path to install the libglvnd EGL vendor library config files. Check that you have pkg-config and the libglvnd development libraries installed, or specify a path with --glvnd-egl-config-path.

+ mkdir -p /usr/src/nvidia-550.54.14
+ mv LICENSE mkprecompiled kernel /usr/src/nvidia-550.54.14
+ sed '9,${/^\(kernel\|LICENSE\)/!d}' .manifest

========== NVIDIA Software Installer ==========

+ echo -e '\n========== NVIDIA Software Installer ==========\n'
+ echo -e 'Starting installation of NVIDIA driver version 550.54.14 for Linux kernel version 4.18.0-513.18.1.el8_9.x86_64\n'
Starting installation of NVIDIA driver version 550.54.14 for Linux kernel version 4.18.0-513.18.1.el8_9.x86_64

+ exec
+ flock -n 3
+ echo 332725
+ trap 'echo '\''Caught signal'\''; exit 1' HUP INT QUIT PIPE TERM
+ trap _shutdown EXIT
+ _unload_driver
+ rmmod_args=()
+ local rmmod_args
+ local nvidia_deps=0
+ local nvidia_refs=0
+ local nvidia_uvm_refs=0
+ local nvidia_modeset_refs=0
+ local nvidia_peermem_refs=0
+ echo 'Stopping NVIDIA persistence daemon...'
Stopping NVIDIA persistence daemon...
+ '[' -f /var/run/nvidia-persistenced/nvidia-persistenced.pid ']'
+ '[' -f /var/run/nvidia-gridd/nvidia-gridd.pid ']'
+ '[' -f /var/run/nvidia-fabricmanager/nv-fabricmanager.pid ']'
+ echo 'Unloading NVIDIA driver kernel modules...'
Unloading NVIDIA driver kernel modules...
+ '[' -f /sys/module/nvidia_modeset/refcnt ']'
+ '[' -f /sys/module/nvidia_uvm/refcnt ']'
+ '[' -f /sys/module/nvidia/refcnt ']'
+ '[' -f /sys/module/nvidia_peermem/refcnt ']'
+ '[' 0 -gt 0 ']'
+ '[' 0 -gt 0 ']'
+ '[' 0 -gt 0 ']'
+ '[' 0 -gt 0 ']'
+ '[' 0 -gt 0 ']'
+ return 0
+ _unmount_rootfs
Unmounting NVIDIA driver rootfs...
+ echo 'Unmounting NVIDIA driver rootfs...'
+ findmnt -r -o TARGET
+ grep /run/nvidia/driver
+ _build
+ _kernel_requires_package
+ local proc_mount_arg=
Checking NVIDIA driver packages...
+ echo 'Checking NVIDIA driver packages...'
+ [[ ! -d /usr/src/nvidia-550.54.14/kernel ]]
+ cd /usr/src/nvidia-550.54.14/kernel
+ proc_mount_arg='--proc-mount-point /lib/modules/4.18.0-513.18.1.el8_9.x86_64/proc'
++ ls -d -1 'precompiled/**'
+ return 0
+ _update_package_cache
+ '[' '' '!=' builtin ']'
Updating the package cache...
+ echo 'Updating the package cache...'
+ yum -q makecache
+ _install_prerequisites
++ mktemp -d
+ local tmp_dir=/tmp/tmp.2PbAo42Ahy
+ trap 'rm -rf /tmp/tmp.2PbAo42Ahy' EXIT
+ cd /tmp/tmp.2PbAo42Ahy
+ echo 'Installing elfutils...'
Installing elfutils...
+ dnf install -q -y elfutils-libelf.x86_64 elfutils-libelf-devel.x86_64
Error: Unable to find a match: elfutils-libelf-devel.x86_64
FATAL: failed to install elfutils packages. RHEL entitlement may be improperly deployed.
+ echo 'FATAL: failed to install elfutils packages. RHEL entitlement may be improperly deployed.'
+ exit 1
++ rm -rf /tmp/tmp.2PbAo42Ahy
+ _shutdown
+ _unload_driver
+ rmmod_args=()
+ local rmmod_args
+ local nvidia_deps=0
+ local nvidia_refs=0
+ local nvidia_uvm_refs=0
+ local nvidia_modeset_refs=0
+ local nvidia_peermem_refs=0
+ echo 'Stopping NVIDIA persistence daemon...'
Stopping NVIDIA persistence daemon...
+ '[' -f /var/run/nvidia-persistenced/nvidia-persistenced.pid ']'
+ '[' -f /var/run/nvidia-gridd/nvidia-gridd.pid ']'
+ '[' -f /var/run/nvidia-fabricmanager/nv-fabricmanager.pid ']'
+ echo 'Unloading NVIDIA driver kernel modules...'
Unloading NVIDIA driver kernel modules...
+ '[' -f /sys/module/nvidia_modeset/refcnt ']'
+ '[' -f /sys/module/nvidia_uvm/refcnt ']'
+ '[' -f /sys/module/nvidia/refcnt ']'
+ '[' -f /sys/module/nvidia_peermem/refcnt ']'
+ '[' 0 -gt 0 ']'
+ '[' 0 -gt 0 ']'
+ '[' 0 -gt 0 ']'
+ '[' 0 -gt 0 ']'
+ '[' 0 -gt 0 ']'
+ return 0
+ _unmount_rootfs
Unmounting NVIDIA driver rootfs...
+ echo 'Unmounting NVIDIA driver rootfs...'
+ findmnt -r -o TARGET
+ grep /run/nvidia/driver
+ rm -f /run/nvidia/nvidia-driver.pid /run/kernel/postinst.d/update-nvidia-driver
+ return 0
  • [x] Output from running nvidia-smi from the driver container: kubectl exec DRIVER_POD_NAME -n OPERATOR_NAMESPACE -c nvidia-driver-ctr -- nvidia-smi
  • [x] containerd logs journalctl -u containerd > containerd.log
Mar 11 20:28:36 test-cnnmnql20b6ec423fsv0-brucetestro-v100-00000380 crio[9148]: time="2024-03-11 20:28:36.897237180-05:00" level=warning msg="Failed to mount subscriptions, skipping entry in /usr/share/containers/mounts.conf: saving data to container filesystem on host \"/var/data/crioruntimestorage/overlay-containers/ce75d5618b76c3fd6febf508a4d142f66ca6c46040f2c7f8a74bc0cbc88ceeb4/userdata/run/secrets\": write subscription data: write file: open /var/data/crioruntimestorage/overlay-containers/ce75d5618b76c3fd6febf508a4d142f66ca6c46040f2c7f8a74bc0cbc88ceeb4/userdata/run/secrets/etc-pki-entitlement/6292044582955687386-key.pem: no such file or directory"
Mar 11 20:28:37 test-cnnmnql20b6ec423fsv0-brucetestro-v100-00000380 crio[9148]: time="2024-03-11 20:28:37.017844642-05:00" level=info msg="Created container ce75d5618b76c3fd6febf508a4d142f66ca6c46040f2c7f8a74bc0cbc88ceeb4: gpu-operator-resources/nvidia-driver-daemonset-c9zfb/nvidia-driver-ctr" id=845fbe19-ee47-4a2a-813f-d0bd23f6ba6c name=/runtime.v1.RuntimeService/CreateContainer
Mar 11 20:28:37 test-cnnmnql20b6ec423fsv0-brucetestro-v100-00000380 crio[9148]: time="2024-03-11 20:28:37.018493833-05:00" level=info msg="Starting container: ce75d5618b76c3fd6febf508a4d142f66ca6c46040f2c7f8a74bc0cbc88ceeb4" id=cdff208d-b23c-4d0f-baae-db8e4dee04c1 name=/runtime.v1.RuntimeService/StartContainer
Mar 11 20:28:37 test-cnnmnql20b6ec423fsv0-brucetestro-v100-00000380 crio[9148]: time="2024-03-11 20:28:37.025441319-05:00" level=info msg="Started container" PID=332725 containerID=ce75d5618b76c3fd6febf508a4d142f66ca6c46040f2c7f8a74bc0cbc88ceeb4 description=gpu-operator-resources/nvidia-driver-daemonset-c9zfb/nvidia-driver-ctr id=cdff208d-b23c-4d0f-baae-db8e4dee04c1 name=/runtime.v1.RuntimeService/StartContainer sandboxID=bf904eb1f2c645c2c74a61a73f0a1d70d4a530fcf971142816c6c05163b332d6

Collecting full debug bundle (optional):

curl -o must-gather.sh -L https://raw.githubusercontent.com/NVIDIA/gpu-operator/master/hack/must-gather.sh 
chmod +x must-gather.sh
./must-gather.sh

NOTE: please refer to the must-gather script for debug data collected.

This bundle can be submitted to us via email: [email protected]

tl;dr

It looks like the change is specifically the change from using the os.WriteFile function to using the umask.WriteFileIgnoreUmask function on this line: https://github.com/cri-o/cri-o/pull/7774/files#diff-23e01fcec1708a4fa51b3f495b7c7f075070b0a9c5a9195f349efee6d9444d4dR271

crio fails to mount the subscription to the container, as see in these logs (more above):

Failed to mount subscriptions, skipping entry in /usr/share/containers/mounts.conf: saving data to container filesystem
``

KodieGlosserIBM avatar Mar 12 '24 01:03 KodieGlosserIBM

Emailed must gather to [email protected]

KodieGlosserIBM avatar Mar 12 '24 01:03 KodieGlosserIBM

I think this problem is RHEL or OpenShift specific. I have K8s 1.25.5 running on CRI-O 1.25.1 (runc) on Rocky Linux 8.7, GPU operator runs without issues

Zveroloff avatar Mar 13 '24 08:03 Zveroloff

@Zveroloff have you tried upgrade cri-o to cri-o-1.25.5-10. This is something we just started recently seeing after this last version bump.

KodieGlosserIBM avatar Mar 13 '24 14:03 KodieGlosserIBM

@fabiendupont can you help to address this issue in CRI-O which is causing subscription mounts to fail.

shivamerla avatar Mar 21 '24 09:03 shivamerla

Hello everyone!

The work on the CRI-O's side (via https://github.com/cri-o/cri-o/issues/7880) has been completed already.

There should be no more issues with CRI-O 1.25 and 1.26 (newer releases of CRI-O were not affected) that would prevent this operator from being run.

kwilczynski avatar Apr 01 '24 05:04 kwilczynski

Thanks for the update @kwilczynski

shivamerla avatar Apr 01 '24 17:04 shivamerla

@francisguillier, your issue appears to be unrelated to the problem we have here.

Hopefully, you were able to resolve it.

kwilczynski avatar Apr 02 '24 12:04 kwilczynski

@kwilczynski - I saw that the fix was backported to 4.12.54 RHSA

I've updated a cluster to this newer version and still see the issue present:

Worker Info:

cash-worker-cf-01.c1-ocp.sl.cloud9.ibm.com     Ready,SchedulingDisabled   canary,worker   27h     v1.25.16+9946c63   169.60.156.4    <none>        Red Hat Enterprise Linux CoreOS 412.86.202403280709-0 (Ootpa)   4.18.0-372.98.1.el8_6.x86_64   cri-o://1.25.5-13.1.rhaos4.12.git76343da.el8

Nvidia Pods on Worker:

╰$ oc get pods -o wide -A | grep cash-worker-cf-01.c1-ocp.sl.cloud9.ibm.com | grep nvidia
nvidia-gpu-operator                                gpu-feature-discovery-rcdcf                                             0/1     Init:0/1                     0                  162m    10.130.2.16     cash-worker-cf-01.c1-ocp.sl.cloud9.ibm.com     <none>           <none>
nvidia-gpu-operator                                nvidia-container-toolkit-daemonset-lxpm7                                0/1     Init:0/1                     0                  162m    10.130.2.17     cash-worker-cf-01.c1-ocp.sl.cloud9.ibm.com     <none>           <none>
nvidia-gpu-operator                                nvidia-dcgm-exporter-n227d                                              0/1     Init:0/1                     0                  162m    169.60.156.4    cash-worker-cf-01.c1-ocp.sl.cloud9.ibm.com     <none>           <none>
nvidia-gpu-operator                                nvidia-dcgm-kr6v8                                                       0/1     Init:0/1                     0                  162m    169.60.156.4    cash-worker-cf-01.c1-ocp.sl.cloud9.ibm.com     <none>           <none>
nvidia-gpu-operator                                nvidia-device-plugin-daemonset-stgwm                                    0/1     Init:0/1                     0                  162m    10.130.2.19     cash-worker-cf-01.c1-ocp.sl.cloud9.ibm.com     <none>           <none>
nvidia-gpu-operator                                nvidia-driver-daemonset-kz5th                                           0/1     CrashLoopBackOff             294 (53s ago)      26h     10.130.2.2      cash-worker-cf-01.c1-ocp.sl.cloud9.ibm.com     <none>           <none>
nvidia-gpu-operator                                nvidia-node-status-exporter-rcmdj                                       1/1     Running                      3                  28h     10.130.2.5      cash-worker-cf-01.c1-ocp.sl.cloud9.ibm.com     <none>           <none>
nvidia-gpu-operator                                nvidia-operator-validator-hpkvt                                         0/1     Init:0/4                     0                  162m    10.130.2.18     cash-worker-cf-01.c1-ocp.sl.cloud9.ibm.com     <none>           <none>

The output of the failing pod shows the same error as before:

+ echo 'Installing elfutils...'
156
Installing elfutils...
157
+ dnf install -q -y elfutils-libelf.x86_64 elfutils-libelf-devel.x86_64
158
Error: Unable to find a match: elfutils-libelf-devel.x86_64
159
FATAL: failed to install elfutils packages. RHEL entitlement may be improperly deployed.
160
+ echo 'FATAL: failed to install elfutils packages. RHEL entitlement may be improperly deployed.'
161
+ exit 1
162
++ rm -rf /tmp/tmp.AIojKsyUdp

jmkanz avatar Apr 04 '24 18:04 jmkanz

@jmkanz a couple of things:

  • a better forum may be (if possible) an openshift Jira ticket, as this forum is really more for upstream cri-o, and these versions are out of upstream support
  • can you help me put together a more minimal reproducer? I attempted to install the nvidia operator, and created a clusterpolicy and nvidia driver instance, but I wonder if I did the right steps as I'm getting different failures (and I doubt the cluster I installed has GPUs to provision)
    • I also tried to use a ubi8 image and I was able to install packages (elfutils was installed in ubi8 base, but I could install other packages, and I could also install it in ubi8-minimal with microdnf). I do get warnings about not having entitlement certs (Found 0 entitlement certificates), but that's a different one than you are hitting

haircommander avatar Apr 05 '24 16:04 haircommander

@jmkanz can you post the status of all pods in the cluster please (specially coredns).

cash-worker-cf-01.c1-ocp.sl.cloud9.ibm.com     Ready,SchedulingDisabled   canary,worker   27h     v1.25.16+9946c63   169.60.156.4    <none>        Red Hat Enterprise Linux CoreOS 412.86.202403280709-0 (Ootpa)   4.18.0-372.98.1.el8_6.x86_64   cri-o://1.25.5-13.1.rhaos4.12.git76343da.el8

GPU Operator does seem to cordon the node in this case, so wondering if any networking pods are being evicted, which will cause the driver install to fail.

shivamerla avatar Apr 05 '24 17:04 shivamerla

@shivamerla - I've manually cordoned this node since i've updated it to the latest version of Open Shift 4.12

The cordon should not impact the nvidia pods as they run as daemonsets. I've cordoned other nodes in the cluster as well (that are on a older version of CoreOS) and they run fine with or without the cordon.

Additionally, other pods are running fine on the node. I can give you an output of them. Please see below: edit to sanitize IP's

oc get pods -A -o wide |grep cash-worker-cf-01.c1-ocp.sl.cloud9.ibm.com
ibm-object-s3fs                                    ibmcloud-object-storage-driver-l925j                                    1/1     Running                      0                   126m    
ibm-observe                                        logdna-agent-8xssh                                                      1/1     Running                      3                   2d3h    
ibm-observe                                        sysdig-agent-n6v9s                                                      1/1     Running                      3                   2d3h    
jeg                                                kernel-image-puller-5bfpp                                               1/1     Running                      3                   2d3h    
kube-system                                        istio-cni-node-cl5sj                                                    1/1     Running                      3                   2d3h    
nvidia-gpu-operator                                gpu-feature-discovery-rcdcf                                             0/1     Init:0/1                     0                   25h     
nvidia-gpu-operator                                nvidia-container-toolkit-daemonset-lxpm7                                0/1     Init:0/1                     0                  
nvidia-gpu-operator                                nvidia-dcgm-exporter-n227d                                              0/1     Init:0/1                     0                   25h     
nvidia-gpu-operator                                nvidia-dcgm-kr6v8                                                       0/1     Init:0/1                     0                   25h     
nvidia-gpu-operator                                nvidia-device-plugin-daemonset-stgwm                                    0/1     Init:0/1                     0                   25h     
nvidia-gpu-operator                                nvidia-driver-daemonset-kz5th                                           0/1     CrashLoopBackOff             558 (14s ago)       2d1h    
nvidia-gpu-operator                                nvidia-node-status-exporter-rcmdj                                       1/1     Running                      3                   2d3h    
nvidia-gpu-operator                                nvidia-operator-validator-hpkvt                                         0/1     Init:0/4                     0                   25h    
openshift-cluster-node-tuning-operator             tuned-vnkmb                                                             1/1     Running                      1                   47h     
openshift-dns                                      dns-default-4vdm9                                                       2/2     Running                      2                   47h     
openshift-dns                                      node-resolver-l5vtp                                                     1/1     Running                      1                   47h     
openshift-image-registry                           node-ca-zmnhr                                                           1/1     Running                      1                   47h     
openshift-ingress-canary                           ingress-canary-xrdwz                                                    1/1     Running                      1                   47h     
openshift-machine-config-operator                  machine-config-daemon-bpj6x                                             2/2     Running                      2                   
openshift-monitoring                               node-exporter-djqxn                                                     2/2     Running                      2                   47h     
openshift-multus                                   multus-additional-cni-plugins-zhw2l                                     1/1     Running                      1                   47h     
openshift-multus                                   multus-ddwks                                                            1/1     Running                      1                   47h    
openshift-multus                                   network-metrics-daemon-cbj56                                            2/2     Running                      2                   47h     
openshift-network-diagnostics                      network-check-target-c6snb                                              1/1     Running                      1                   47h     
openshift-nfd                                      nfd-worker-pflkc                                                        1/1     Running                      3                   2d3h    
openshift-sdn                                      sdn-j7kng                                                               2/2     Running                      2                   47h     
openshift-storage                                  csi-cephfsplugin-fndsj                                                  2/2     Running                      6                   2d3h   
openshift-storage                                  csi-rbdplugin-5c6ml                                                     3/3     Running                      9                   2d3h    
tekton-pipelines                                   pwa-2r24x                                                               1/1     Running                      3                   2d3h    

DNS Pods for the cluster as well:

╰$ oc get pods -A |grep dns
openshift-dns-operator                             dns-operator-7f86f6f997-766l4                                           2/2     Running                      0                   47h
openshift-dns                                      dns-default-4n6hq                                                       2/2     Running                      0                   47h
openshift-dns                                      dns-default-4vdm9                                                       2/2     Running                      2                   47h
openshift-dns                                      dns-default-7v9rx                                                       2/2     Running                      0                   47h
openshift-dns                                      dns-default-9wwps                                                       2/2     Running                      0                   47h
openshift-dns                                      dns-default-bcv7s                                                       2/2     Running                      2                   47h
openshift-dns                                      dns-default-bzsmp                                                       2/2     Running                      0                   47h
openshift-dns                                      dns-default-csrpd                                                       2/2     Running                      0                   47h
openshift-dns                                      dns-default-d677l                                                       2/2     Running                      0                   47h
openshift-dns                                      dns-default-dv45x                                                       2/2     Running                      0                   47h
openshift-dns                                      dns-default-j7xcv                                                       2/2     Running                      4                   47h
openshift-dns                                      dns-default-jb62l                                                       2/2     Running                      0                   47h
openshift-dns                                      dns-default-lkq76                                                       2/2     Running                      2                   47h
openshift-dns                                      dns-default-lpsfq                                                       2/2     Running                      0                   47h
openshift-dns                                      dns-default-m6hr9                                                       2/2     Running                      0                   47h
openshift-dns                                      dns-default-pf825                                                       2/2     Running                      0                   47h
openshift-dns                                      dns-default-tj4bw                                                       2/2     Running                      0                   47h
openshift-dns                                      dns-default-zjpsz                                                       2/2     Running                      2                   47h
openshift-dns                                      dns-default-zl52j                                                       2/2     Running                      0                   47h
openshift-dns                                      dns-default-zsgx8                                                       2/2     Running                      0                   47h
openshift-dns                                      node-resolver-2vsc8                                                     1/1     Running                      1                   47h
openshift-dns                                      node-resolver-48nkh                                                     1/1     Running                      0                   47h
openshift-dns                                      node-resolver-59vb8                                                     1/1     Running                      0                   47h
openshift-dns                                      node-resolver-74btd                                                     1/1     Running                      0                   47h
openshift-dns                                      node-resolver-c5d4p                                                     1/1     Running                      0                   47h
openshift-dns                                      node-resolver-c5q44                                                     1/1     Running                      0                   47h
openshift-dns                                      node-resolver-clck8                                                     1/1     Running                      0                   47h
openshift-dns                                      node-resolver-fjgnb                                                     1/1     Running                      1                   47h
openshift-dns                                      node-resolver-g54sd                                                     1/1     Running                      0                   47h
openshift-dns                                      node-resolver-gd6rk                                                     1/1     Running                      0                   47h
openshift-dns                                      node-resolver-gs4z2                                                     1/1     Running                      2                   47h
openshift-dns                                      node-resolver-l5n4z                                                     1/1     Running                      0                   47h
openshift-dns                                      node-resolver-l5vtp                                                     1/1     Running                      1                   47h
openshift-dns                                      node-resolver-l9kjc                                                     1/1     Running                      0                   47h
openshift-dns                                      node-resolver-rls8p                                                     1/1     Running                      1                   47h
openshift-dns                                      node-resolver-tr6wf                                                     1/1     Running                      0                   47h
openshift-dns                                      node-resolver-vlnj4                                                     1/1     Running                      0                   47h
openshift-dns                                      node-resolver-whzp6                                                     1/1     Running                      0                   47h
openshift-dns                                      node-resolver-wrnfs                                                     1/1     Running                      0                   47h
openshift-dns                                      node-resolver-xqwxs                                                     1/1     Running                      0                   47h
openshift-dns                                      node-resolver-zhb5c                                                     1/1     Running                      0                   47h
openshift-dns                                      node-resolver-zm4m8                                                     1/1     Running                      0                   47h
openshift-dns                                      node-resolver-zph8p                                                     1/1     Running                      0                   47h

jmkanz avatar Apr 05 '24 17:04 jmkanz

@jmkanz a couple of things:

  • a better forum may be (if possible) an openshift Jira ticket, as this forum is really more for upstream cri-o, and these versions are out of upstream support

  • can you help me put together a more minimal reproducer? I attempted to install the nvidia operator, and created a clusterpolicy and nvidia driver instance, but I wonder if I did the right steps as I'm getting different failures (and I doubt the cluster I installed has GPUs to provision)

    • I also tried to use a ubi8 image and I was able to install packages (elfutils was installed in ubi8 base, but I could install other packages, and I could also install it in ubi8-minimal with microdnf). I do get warnings about not having entitlement certs (Found 0 entitlement certificates), but that's a different one than you are hitting

Hey @haircommander - Thanks for your reply. We can move the conversation over to the other git issue in the CRI-O repo if you prefer? This is the NVIDIA one. Additionally, if your cluster doesn't have GPU's I doubt the install will even begin since you need the correct labels from NFD operator for GPU enabled workers.

I believe @KodieGlosserIBM has the issue open in JIRA with Red Hat still

jmkanz avatar Apr 05 '24 18:04 jmkanz

ah I thought I was commenting there :upside_down_face: . this is fine too if this feels right.

Still wondering about a more minimal reproducer, potentially without nvidia operator in the picture. Or, if you could help me get access to the environment with this failing, that would work too

haircommander avatar Apr 05 '24 18:04 haircommander

This seems to be resolved. I noticed this cluster had a ClusterPolicy that was not using the OCP Driver Toolkit which prevents these entitlement issues.

Please see links below for more information

Open Shift Driver Toolkit Info: https://docs.openshift.com/container-platform/4.12/hardware_enablement/psap-driver-toolkit.html

NVIDIA Docs on Installation with / without Driver Toolkit: https://docs.nvidia.com/datacenter/cloud-native/openshift/latest/steps-overview.html

jmkanz avatar Apr 08 '24 16:04 jmkanz

@jmkanz and @KodieGlosserIBM, thank you for the update!

Good to know that things are working fine. :tada:

kwilczynski avatar Apr 09 '24 11:04 kwilczynski

Hello everyone! :wave: Are we still having issues with the operator installation on CRI-O 1.25 and 1.26?

I think this problem has been resolved, and we could close this issue? Thoughts?

kwilczynski avatar May 02 '24 02:05 kwilczynski