gpu-operator
gpu-operator copied to clipboard
Latest CRI-O (on 1.25/1.26) failing to install gpu-operator
The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.
Important Note: NVIDIA AI Enterprise customers can get support from NVIDIA Enterprise support. Please open a case here.
tl;dr at the bottom
1. Quick Debug Information
- OS/Version(e.g. RHEL8.6, Ubuntu22.04): RHEL8.9
- Kernel Version:
4.18.0-513.18.1.el8_9.x86_64
- Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): cri-o (version 1.25 and 1.26)
- K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): OCP (version 4.12, 4.13)
- GPU Operator Version: 23.9.2 (latest)
2. Issue or feature description
Briefly explain the issue in terms of expected behavior and current behavior.
Pulling in the most recent cri-o changes on OCP 4.12/4.13 https://github.com/cri-o/cri-o/compare/1b1a520...8724c4d
CRI-O: cri-o-1.25.5-10.rhaos4.12.git8724c4d.el8
cri-o-1.26.5-7.rhaos4.13.git692ef91.el8
GPU installer is failling to install elfutils
Installing elfutils...
+ echo 'Installing elfutils...'
+ dnf install -q -y elfutils-libelf.x86_64 elfutils-libelf-devel.x86_64
Error: Unable to find a match: elfutils-libelf-devel.x86_64
FATAL: failed to install elfutils packages. RHEL entitlement may be improperly deployed.
+ echo 'FATAL: failed to install elfutils packages. RHEL entitlement may be improperly deployed.'
+ exit 1
3. Steps to reproduce the issue
Detailed steps to reproduce the issue.
Use container runtime crio-o on version(s) cri-o-1.25.5-10.rhaos4.12.git8724c4d.el8
cri-o-1.26.5-7.rhaos4.13.git692ef91.el8
4. Information to attach (optional if deemed irrelevant)
- [x] kubernetes pods status:
kubectl get pods -n OPERATOR_NAMESPACE
k get pods -n gpu-operator-resources -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
gpu-feature-discovery-9qq4g 0/1 Init:0/1 0 3h28m 172.17.162.202 10.180.8.40 <none> <none>
gpu-operator-8b54f655-45f6k 1/1 Running 0 3h46m 172.17.162.253 10.180.8.40 <none> <none>
nfd-controller-manager-5988c689d-ddg4q 2/2 Running 0 4h 172.17.162.248 10.180.8.40 <none> <none>
nfd-master-966d4c54c-l7mv4 1/1 Running 0 3h47m 172.17.162.251 10.180.8.40 <none> <none>
nfd-worker-48tnv 1/1 Running 0 3h47m 10.180.8.39 10.180.8.39 <none> <none>
nfd-worker-8n856 1/1 Running 1 (3h47m ago) 3h47m 10.180.8.40 10.180.8.40 <none> <none>
nfd-worker-jqssq 1/1 Running 0 3h47m 10.180.8.38 10.180.8.38 <none> <none>
nvidia-container-toolkit-daemonset-n7k8z 0/1 Init:0/1 0 3h28m 172.17.162.230 10.180.8.40 <none> <none>
nvidia-dcgm-exporter-hmg7m 0/1 Init:0/1 0 3h28m 10.180.8.40 10.180.8.40 <none> <none>
nvidia-dcgm-fnz2j 0/1 Init:0/1 0 3h28m 10.180.8.40 10.180.8.40 <none> <none>
nvidia-device-plugin-daemonset-zcbbl 0/1 Init:0/1 0 3h28m 172.17.162.254 10.180.8.40 <none> <none>
nvidia-driver-daemonset-c9zfb 0/1 CrashLoopBackOff 43 (5m ago) 3h29m 172.17.162.220 10.180.8.40 <none> <none>
nvidia-node-status-exporter-b6gxl 1/1 Running 0 3h43m 172.17.162.198 10.180.8.40 <none> <none>
nvidia-operator-validator-zr6sb 0/1 Init:0/4 0 3h28m 172.17.162.252 10.180.8.40 <none> <none>
- [x] kubernetes daemonset status:
kubectl get ds -n OPERATOR_NAMESPACE
k get ds -n gpu-operator-resources
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
gpu-feature-discovery 1 1 0 1 0 nvidia.com/gpu.deploy.gpu-feature-discovery=true 3h43m
nfd-worker 3 3 3 3 3 <none> 3h48m
nvidia-container-toolkit-daemonset 1 1 0 1 0 nvidia.com/gpu.deploy.container-toolkit=true 3h43m
nvidia-dcgm 1 1 0 1 0 nvidia.com/gpu.deploy.dcgm=true 3h43m
nvidia-dcgm-exporter 1 1 0 1 0 nvidia.com/gpu.deploy.dcgm-exporter=true 3h43m
nvidia-device-plugin-daemonset 1 1 0 1 0 nvidia.com/gpu.deploy.device-plugin=true 3h43m
nvidia-driver-daemonset 1 1 0 1 0 nvidia.com/gpu.deploy.driver=true 3h43m
nvidia-mig-manager 0 0 0 0 0 nvidia.com/gpu.deploy.mig-manager=true 3h43m
nvidia-node-status-exporter 1 1 1 1 1 nvidia.com/gpu.deploy.node-status-exporter=true 3h43m
nvidia-operator-validator 1 1 0 1 0 nvidia.com/gpu.deploy.operator-validator=true 3h43m
- [x] If a pod/ds is in an error state or pending state
kubectl describe pod -n OPERATOR_NAMESPACE POD_NAME
k describe pod -n gpu-operator-resources nvidia-driver-daemonset-c9zfb
Name: nvidia-driver-daemonset-c9zfb
Namespace: gpu-operator-resources
Priority: 2000001000
Priority Class Name: system-node-critical
Service Account: nvidia-driver
Node: 10.180.8.40/10.180.8.40
Start Time: Mon, 11 Mar 2024 16:59:01 -0500
Labels: app=nvidia-driver-daemonset
app.kubernetes.io/component=nvidia-driver
controller-revision-hash=dc74cc498
nvidia.com/precompiled=false
pod-template-generation=3
Annotations: cni.projectcalico.org/containerID: bf904eb1f2c645c2c74a61a73f0a1d70d4a530fcf971142816c6c05163b332d6
cni.projectcalico.org/podIP: 172.17.162.220/32
cni.projectcalico.org/podIPs: 172.17.162.220/32
k8s.v1.cni.cncf.io/network-status:
[{
"name": "k8s-pod-network",
"ips": [
"172.17.162.220"
],
"default": true,
"dns": {}
}]
kubectl.kubernetes.io/default-container: nvidia-driver-ctr
openshift.io/scc: nvidia-driver
Status: Running
IP: 172.17.162.220
IPs:
IP: 172.17.162.220
Controlled By: DaemonSet/nvidia-driver-daemonset
Init Containers:
k8s-driver-manager:
Container ID: cri-o://51588c2c91637fbdaaa68b22e2b9100199a5b8e3afa0b98ea471f1acdc64a716
Image: nvcr.io/nvidia/cloud-native/k8s-driver-manager@sha256:27c44f4720a4abf780217bd5e7903e4a008ebdbcf71238c4f106a0c22654776c
Image ID: nvcr.io/nvidia/cloud-native/k8s-driver-manager@sha256:27c44f4720a4abf780217bd5e7903e4a008ebdbcf71238c4f106a0c22654776c
Port: <none>
Host Port: <none>
Command:
driver-manager
Args:
uninstall_driver
State: Terminated
Reason: Completed
Exit Code: 0
Started: Mon, 11 Mar 2024 16:59:03 -0500
Finished: Mon, 11 Mar 2024 16:59:37 -0500
Ready: True
Restart Count: 0
Environment:
NODE_NAME: (v1:spec.nodeName)
NVIDIA_VISIBLE_DEVICES: void
ENABLE_GPU_POD_EVICTION: true
ENABLE_AUTO_DRAIN: true
DRAIN_USE_FORCE: false
DRAIN_POD_SELECTOR_LABEL:
DRAIN_TIMEOUT_SECONDS: 0s
DRAIN_DELETE_EMPTYDIR_DATA: false
OPERATOR_NAMESPACE: gpu-operator-resources (v1:metadata.namespace)
Mounts:
/host from host-root (ro)
/run/nvidia from run-nvidia (rw)
/sys from host-sys (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-m9xwl (ro)
Containers:
nvidia-driver-ctr:
Container ID: cri-o://ce75d5618b76c3fd6febf508a4d142f66ca6c46040f2c7f8a74bc0cbc88ceeb4
Image: nvcr.io/nvidia/driver@sha256:6f51a22e01fd08ab0fde543e0c4dc6d7f7abb0f20d38205a98f3f1716cb3d7d3
Image ID: nvcr.io/nvidia/driver@sha256:6f51a22e01fd08ab0fde543e0c4dc6d7f7abb0f20d38205a98f3f1716cb3d7d3
Port: <none>
Host Port: <none>
Command:
nvidia-driver
Args:
init
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Exit Code: 1
Started: Mon, 11 Mar 2024 20:28:37 -0500
Finished: Mon, 11 Mar 2024 20:28:49 -0500
Ready: False
Restart Count: 44
Startup: exec [sh -c nvidia-smi && touch /run/nvidia/validations/.driver-ctr-ready] delay=60s timeout=60s period=10s #success=1 #failure=120
Environment: <none>
Mounts:
/dev/log from dev-log (rw)
/host-etc/os-release from host-os-release (ro)
/lib/firmware from nv-firmware (rw)
/run/mellanox/drivers from run-mellanox-drivers (rw)
/run/mellanox/drivers/usr/src from mlnx-ofed-usr-src (rw)
/run/nvidia from run-nvidia (rw)
/run/nvidia-topologyd from run-nvidia-topologyd (rw)
/sys/devices/system/memory/auto_online_blocks from sysfs-memory-online (rw)
/sys/module/firmware_class/parameters/path from firmware-search-path (rw)
/var/log from var-log (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-m9xwl (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
run-nvidia:
Type: HostPath (bare host directory volume)
Path: /run/nvidia
HostPathType: DirectoryOrCreate
var-log:
Type: HostPath (bare host directory volume)
Path: /var/log
HostPathType:
dev-log:
Type: HostPath (bare host directory volume)
Path: /dev/log
HostPathType:
host-os-release:
Type: HostPath (bare host directory volume)
Path: /etc/os-release
HostPathType:
run-nvidia-topologyd:
Type: HostPath (bare host directory volume)
Path: /run/nvidia-topologyd
HostPathType: DirectoryOrCreate
mlnx-ofed-usr-src:
Type: HostPath (bare host directory volume)
Path: /run/mellanox/drivers/usr/src
HostPathType: DirectoryOrCreate
run-mellanox-drivers:
Type: HostPath (bare host directory volume)
Path: /run/mellanox/drivers
HostPathType: DirectoryOrCreate
run-nvidia-validations:
Type: HostPath (bare host directory volume)
Path: /run/nvidia/validations
HostPathType: DirectoryOrCreate
host-root:
Type: HostPath (bare host directory volume)
Path: /
HostPathType:
host-sys:
Type: HostPath (bare host directory volume)
Path: /sys
HostPathType: Directory
firmware-search-path:
Type: HostPath (bare host directory volume)
Path: /sys/module/firmware_class/parameters/path
HostPathType:
sysfs-memory-online:
Type: HostPath (bare host directory volume)
Path: /sys/devices/system/memory/auto_online_blocks
HostPathType:
nv-firmware:
Type: HostPath (bare host directory volume)
Path: /run/nvidia/driver/lib/firmware
HostPathType: DirectoryOrCreate
kube-api-access-m9xwl:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
ConfigMapName: openshift-service-ca.crt
ConfigMapOptional: <nil>
QoS Class: BestEffort
Node-Selectors: nvidia.com/gpu.deploy.driver=true
Tolerations: node.kubernetes.io/disk-pressure:NoSchedule op=Exists
node.kubernetes.io/memory-pressure:NoSchedule op=Exists
node.kubernetes.io/not-ready:NoExecute op=Exists
node.kubernetes.io/pid-pressure:NoSchedule op=Exists
node.kubernetes.io/unreachable:NoExecute op=Exists
node.kubernetes.io/unschedulable:NoSchedule op=Exists
nvidia.com/gpu:NoSchedule op=Exists
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Pulled 80m (x29 over 3h29m) kubelet Container image "nvcr.io/nvidia/driver@sha256:6f51a22e01fd08ab0fde543e0c4dc6d7f7abb0f20d38205a98f3f1716cb3d7d3" already present on machine
Warning BackOff 29s (x947 over 3h29m) kubelet Back-off restarting failed container nvidia-driver-ctr in pod nvidia-driver-daemonset-c9zfb_gpu-operator-resources(8a0d8d4f-9f88-4c42-930d-508ab7653a98)
- [x] If a pod/ds is in an error state or pending state
kubectl logs -n OPERATOR_NAMESPACE POD_NAME --all-containers
k logs -n gpu-operator-resources nvidia-driver-daemonset-c9zfb -c nvidia-driver-ctr -p
+ set -eu
+ RUN_DIR=/run/nvidia
+ PID_FILE=/run/nvidia/nvidia-driver.pid
+ DRIVER_VERSION=550.54.14
+ KERNEL_UPDATE_HOOK=/run/kernel/postinst.d/update-nvidia-driver
+ NUM_VGPU_DEVICES=0
+ NVIDIA_MODULE_PARAMS=()
+ NVIDIA_UVM_MODULE_PARAMS=()
DRIVER_ARCH is x86_64
+ NVIDIA_MODESET_MODULE_PARAMS=()
+ NVIDIA_PEERMEM_MODULE_PARAMS=()
+ TARGETARCH=amd64
+ USE_HOST_MOFED=false
+ DNF_RELEASEVER=
+ RHEL_VERSION=
+ RHEL_MAJOR_VERSION=8
+ OPEN_KERNEL_MODULES_ENABLED=false
+ [[ false == \t\r\u\e ]]
+ KERNEL_TYPE=kernel
+ DRIVER_ARCH=x86_64
+ DRIVER_ARCH=x86_64
+ echo 'DRIVER_ARCH is x86_64'
+++ dirname -- /usr/local/bin/nvidia-driver
++ cd -- /usr/local/bin
++ pwd
+ SCRIPT_DIR=/usr/local/bin
+ source /usr/local/bin/common.sh
++ GPU_DIRECT_RDMA_ENABLED=false
++ GDS_ENABLED=false
++ GDRCOPY_ENABLED=false
+ '[' 1 -eq 0 ']'
+ command=init
+ shift
+ case "${command}" in
++ getopt -l accept-license -o a --
+ options=' --'
+ '[' 0 -ne 0 ']'
+ eval set -- ' --'
++ set -- --
+ ACCEPT_LICENSE=
++ uname -r
+ KERNEL_VERSION=4.18.0-513.18.1.el8_9.x86_64
+ PRIVATE_KEY=
+ PACKAGE_TAG=
+ for opt in ${options}
+ case "$opt" in
+ shift
+ break
+ '[' 0 -ne 0 ']'
+ _resolve_rhel_version
+ _get_rhel_version_from_kernel
+ local rhel_version_underscore rhel_version_arr
++ echo 4.18.0-513.18.1.el8_9.x86_64
++ sed 's/.*el\([0-9]\+_[0-9]\+\).*/\1/g'
+ rhel_version_underscore=8_9
+ [[ ! 8_9 =~ ^[0-9]+_[0-9]+$ ]]
+ IFS=_
+ read -r -a rhel_version_arr
+ [[ 2 -ne 2 ]]
+ RHEL_VERSION=8.9
+ echo 'RHEL VERSION successfully resolved from kernel: 8.9'
RHEL VERSION successfully resolved from kernel: 8.9
+ return 0
+ [[ -z '' ]]
+ DNF_RELEASEVER=8.9
+ return 0
+ init
+ _prepare_exclusive
+ _prepare
+ '[' passthrough = vgpu ']'
+ sh NVIDIA-Linux-x86_64-550.54.14.run -x
Creating directory NVIDIA-Linux-x86_64-550.54.14
Verifying archive integrity... OK
Uncompressing NVIDIA Accelerated Graphics Driver for Linux-x86_64 550.54.14........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
+ cd NVIDIA-Linux-x86_64-550.54.14
+ sh /tmp/install.sh nvinstall
DRIVER_ARCH is x86_64
WARNING: Unable to determine the default X library path. The path /tmp/null/lib will be used, but this path was not detected in the ldconfig(8) cache, and no directory exists at this path, so it is likely that libraries installed there will not be found by the loader.
WARNING: You specified the '--no-kernel-modules' command line option, nvidia-installer will not install any kernel modules as part of this driver installation, and it will not remove existing NVIDIA kernel modules not part of an earlier NVIDIA driver installation. Please ensure that NVIDIA kernel modules matching this driver version are installed separately.
WARNING: This NVIDIA driver package includes Vulkan components, but no Vulkan ICD loader was detected on this system. The NVIDIA Vulkan ICD will not function without the loader. Most distributions package the Vulkan loader; try installing the "vulkan-loader", "vulkan-icd-loader", or "libvulkan1" package.
WARNING: Unable to determine the path to install the libglvnd EGL vendor library config files. Check that you have pkg-config and the libglvnd development libraries installed, or specify a path with --glvnd-egl-config-path.
+ mkdir -p /usr/src/nvidia-550.54.14
+ mv LICENSE mkprecompiled kernel /usr/src/nvidia-550.54.14
+ sed '9,${/^\(kernel\|LICENSE\)/!d}' .manifest
========== NVIDIA Software Installer ==========
+ echo -e '\n========== NVIDIA Software Installer ==========\n'
+ echo -e 'Starting installation of NVIDIA driver version 550.54.14 for Linux kernel version 4.18.0-513.18.1.el8_9.x86_64\n'
Starting installation of NVIDIA driver version 550.54.14 for Linux kernel version 4.18.0-513.18.1.el8_9.x86_64
+ exec
+ flock -n 3
+ echo 332725
+ trap 'echo '\''Caught signal'\''; exit 1' HUP INT QUIT PIPE TERM
+ trap _shutdown EXIT
+ _unload_driver
+ rmmod_args=()
+ local rmmod_args
+ local nvidia_deps=0
+ local nvidia_refs=0
+ local nvidia_uvm_refs=0
+ local nvidia_modeset_refs=0
+ local nvidia_peermem_refs=0
+ echo 'Stopping NVIDIA persistence daemon...'
Stopping NVIDIA persistence daemon...
+ '[' -f /var/run/nvidia-persistenced/nvidia-persistenced.pid ']'
+ '[' -f /var/run/nvidia-gridd/nvidia-gridd.pid ']'
+ '[' -f /var/run/nvidia-fabricmanager/nv-fabricmanager.pid ']'
+ echo 'Unloading NVIDIA driver kernel modules...'
Unloading NVIDIA driver kernel modules...
+ '[' -f /sys/module/nvidia_modeset/refcnt ']'
+ '[' -f /sys/module/nvidia_uvm/refcnt ']'
+ '[' -f /sys/module/nvidia/refcnt ']'
+ '[' -f /sys/module/nvidia_peermem/refcnt ']'
+ '[' 0 -gt 0 ']'
+ '[' 0 -gt 0 ']'
+ '[' 0 -gt 0 ']'
+ '[' 0 -gt 0 ']'
+ '[' 0 -gt 0 ']'
+ return 0
+ _unmount_rootfs
Unmounting NVIDIA driver rootfs...
+ echo 'Unmounting NVIDIA driver rootfs...'
+ findmnt -r -o TARGET
+ grep /run/nvidia/driver
+ _build
+ _kernel_requires_package
+ local proc_mount_arg=
Checking NVIDIA driver packages...
+ echo 'Checking NVIDIA driver packages...'
+ [[ ! -d /usr/src/nvidia-550.54.14/kernel ]]
+ cd /usr/src/nvidia-550.54.14/kernel
+ proc_mount_arg='--proc-mount-point /lib/modules/4.18.0-513.18.1.el8_9.x86_64/proc'
++ ls -d -1 'precompiled/**'
+ return 0
+ _update_package_cache
+ '[' '' '!=' builtin ']'
Updating the package cache...
+ echo 'Updating the package cache...'
+ yum -q makecache
+ _install_prerequisites
++ mktemp -d
+ local tmp_dir=/tmp/tmp.2PbAo42Ahy
+ trap 'rm -rf /tmp/tmp.2PbAo42Ahy' EXIT
+ cd /tmp/tmp.2PbAo42Ahy
+ echo 'Installing elfutils...'
Installing elfutils...
+ dnf install -q -y elfutils-libelf.x86_64 elfutils-libelf-devel.x86_64
Error: Unable to find a match: elfutils-libelf-devel.x86_64
FATAL: failed to install elfutils packages. RHEL entitlement may be improperly deployed.
+ echo 'FATAL: failed to install elfutils packages. RHEL entitlement may be improperly deployed.'
+ exit 1
++ rm -rf /tmp/tmp.2PbAo42Ahy
+ _shutdown
+ _unload_driver
+ rmmod_args=()
+ local rmmod_args
+ local nvidia_deps=0
+ local nvidia_refs=0
+ local nvidia_uvm_refs=0
+ local nvidia_modeset_refs=0
+ local nvidia_peermem_refs=0
+ echo 'Stopping NVIDIA persistence daemon...'
Stopping NVIDIA persistence daemon...
+ '[' -f /var/run/nvidia-persistenced/nvidia-persistenced.pid ']'
+ '[' -f /var/run/nvidia-gridd/nvidia-gridd.pid ']'
+ '[' -f /var/run/nvidia-fabricmanager/nv-fabricmanager.pid ']'
+ echo 'Unloading NVIDIA driver kernel modules...'
Unloading NVIDIA driver kernel modules...
+ '[' -f /sys/module/nvidia_modeset/refcnt ']'
+ '[' -f /sys/module/nvidia_uvm/refcnt ']'
+ '[' -f /sys/module/nvidia/refcnt ']'
+ '[' -f /sys/module/nvidia_peermem/refcnt ']'
+ '[' 0 -gt 0 ']'
+ '[' 0 -gt 0 ']'
+ '[' 0 -gt 0 ']'
+ '[' 0 -gt 0 ']'
+ '[' 0 -gt 0 ']'
+ return 0
+ _unmount_rootfs
Unmounting NVIDIA driver rootfs...
+ echo 'Unmounting NVIDIA driver rootfs...'
+ findmnt -r -o TARGET
+ grep /run/nvidia/driver
+ rm -f /run/nvidia/nvidia-driver.pid /run/kernel/postinst.d/update-nvidia-driver
+ return 0
- [x] Output from running
nvidia-smi
from the driver container:kubectl exec DRIVER_POD_NAME -n OPERATOR_NAMESPACE -c nvidia-driver-ctr -- nvidia-smi
- [x] containerd logs
journalctl -u containerd > containerd.log
Mar 11 20:28:36 test-cnnmnql20b6ec423fsv0-brucetestro-v100-00000380 crio[9148]: time="2024-03-11 20:28:36.897237180-05:00" level=warning msg="Failed to mount subscriptions, skipping entry in /usr/share/containers/mounts.conf: saving data to container filesystem on host \"/var/data/crioruntimestorage/overlay-containers/ce75d5618b76c3fd6febf508a4d142f66ca6c46040f2c7f8a74bc0cbc88ceeb4/userdata/run/secrets\": write subscription data: write file: open /var/data/crioruntimestorage/overlay-containers/ce75d5618b76c3fd6febf508a4d142f66ca6c46040f2c7f8a74bc0cbc88ceeb4/userdata/run/secrets/etc-pki-entitlement/6292044582955687386-key.pem: no such file or directory"
Mar 11 20:28:37 test-cnnmnql20b6ec423fsv0-brucetestro-v100-00000380 crio[9148]: time="2024-03-11 20:28:37.017844642-05:00" level=info msg="Created container ce75d5618b76c3fd6febf508a4d142f66ca6c46040f2c7f8a74bc0cbc88ceeb4: gpu-operator-resources/nvidia-driver-daemonset-c9zfb/nvidia-driver-ctr" id=845fbe19-ee47-4a2a-813f-d0bd23f6ba6c name=/runtime.v1.RuntimeService/CreateContainer
Mar 11 20:28:37 test-cnnmnql20b6ec423fsv0-brucetestro-v100-00000380 crio[9148]: time="2024-03-11 20:28:37.018493833-05:00" level=info msg="Starting container: ce75d5618b76c3fd6febf508a4d142f66ca6c46040f2c7f8a74bc0cbc88ceeb4" id=cdff208d-b23c-4d0f-baae-db8e4dee04c1 name=/runtime.v1.RuntimeService/StartContainer
Mar 11 20:28:37 test-cnnmnql20b6ec423fsv0-brucetestro-v100-00000380 crio[9148]: time="2024-03-11 20:28:37.025441319-05:00" level=info msg="Started container" PID=332725 containerID=ce75d5618b76c3fd6febf508a4d142f66ca6c46040f2c7f8a74bc0cbc88ceeb4 description=gpu-operator-resources/nvidia-driver-daemonset-c9zfb/nvidia-driver-ctr id=cdff208d-b23c-4d0f-baae-db8e4dee04c1 name=/runtime.v1.RuntimeService/StartContainer sandboxID=bf904eb1f2c645c2c74a61a73f0a1d70d4a530fcf971142816c6c05163b332d6
Collecting full debug bundle (optional):
curl -o must-gather.sh -L https://raw.githubusercontent.com/NVIDIA/gpu-operator/master/hack/must-gather.sh
chmod +x must-gather.sh
./must-gather.sh
NOTE: please refer to the must-gather script for debug data collected.
This bundle can be submitted to us via email: [email protected]
tl;dr
It looks like the change is specifically the change from using the os.WriteFile function to using the umask.WriteFileIgnoreUmask function on this line: https://github.com/cri-o/cri-o/pull/7774/files#diff-23e01fcec1708a4fa51b3f495b7c7f075070b0a9c5a9195f349efee6d9444d4dR271
crio fails to mount the subscription to the container, as see in these logs (more above):
Failed to mount subscriptions, skipping entry in /usr/share/containers/mounts.conf: saving data to container filesystem
``
Emailed must gather to [email protected]
I think this problem is RHEL or OpenShift specific. I have K8s 1.25.5 running on CRI-O 1.25.1 (runc) on Rocky Linux 8.7, GPU operator runs without issues
@Zveroloff have you tried upgrade cri-o to cri-o-1.25.5-10
. This is something we just started recently seeing after this last version bump.
@fabiendupont can you help to address this issue in CRI-O which is causing subscription mounts to fail.
Hello everyone!
The work on the CRI-O's side (via https://github.com/cri-o/cri-o/issues/7880) has been completed already.
There should be no more issues with CRI-O 1.25 and 1.26 (newer releases of CRI-O were not affected) that would prevent this operator from being run.
Thanks for the update @kwilczynski
@francisguillier, your issue appears to be unrelated to the problem we have here.
Hopefully, you were able to resolve it.
@kwilczynski - I saw that the fix was backported to 4.12.54 RHSA
I've updated a cluster to this newer version and still see the issue present:
Worker Info:
cash-worker-cf-01.c1-ocp.sl.cloud9.ibm.com Ready,SchedulingDisabled canary,worker 27h v1.25.16+9946c63 169.60.156.4 <none> Red Hat Enterprise Linux CoreOS 412.86.202403280709-0 (Ootpa) 4.18.0-372.98.1.el8_6.x86_64 cri-o://1.25.5-13.1.rhaos4.12.git76343da.el8
Nvidia Pods on Worker:
╰$ oc get pods -o wide -A | grep cash-worker-cf-01.c1-ocp.sl.cloud9.ibm.com | grep nvidia
nvidia-gpu-operator gpu-feature-discovery-rcdcf 0/1 Init:0/1 0 162m 10.130.2.16 cash-worker-cf-01.c1-ocp.sl.cloud9.ibm.com <none> <none>
nvidia-gpu-operator nvidia-container-toolkit-daemonset-lxpm7 0/1 Init:0/1 0 162m 10.130.2.17 cash-worker-cf-01.c1-ocp.sl.cloud9.ibm.com <none> <none>
nvidia-gpu-operator nvidia-dcgm-exporter-n227d 0/1 Init:0/1 0 162m 169.60.156.4 cash-worker-cf-01.c1-ocp.sl.cloud9.ibm.com <none> <none>
nvidia-gpu-operator nvidia-dcgm-kr6v8 0/1 Init:0/1 0 162m 169.60.156.4 cash-worker-cf-01.c1-ocp.sl.cloud9.ibm.com <none> <none>
nvidia-gpu-operator nvidia-device-plugin-daemonset-stgwm 0/1 Init:0/1 0 162m 10.130.2.19 cash-worker-cf-01.c1-ocp.sl.cloud9.ibm.com <none> <none>
nvidia-gpu-operator nvidia-driver-daemonset-kz5th 0/1 CrashLoopBackOff 294 (53s ago) 26h 10.130.2.2 cash-worker-cf-01.c1-ocp.sl.cloud9.ibm.com <none> <none>
nvidia-gpu-operator nvidia-node-status-exporter-rcmdj 1/1 Running 3 28h 10.130.2.5 cash-worker-cf-01.c1-ocp.sl.cloud9.ibm.com <none> <none>
nvidia-gpu-operator nvidia-operator-validator-hpkvt 0/1 Init:0/4 0 162m 10.130.2.18 cash-worker-cf-01.c1-ocp.sl.cloud9.ibm.com <none> <none>
The output of the failing pod shows the same error as before:
+ echo 'Installing elfutils...'
156
Installing elfutils...
157
+ dnf install -q -y elfutils-libelf.x86_64 elfutils-libelf-devel.x86_64
158
Error: Unable to find a match: elfutils-libelf-devel.x86_64
159
FATAL: failed to install elfutils packages. RHEL entitlement may be improperly deployed.
160
+ echo 'FATAL: failed to install elfutils packages. RHEL entitlement may be improperly deployed.'
161
+ exit 1
162
++ rm -rf /tmp/tmp.AIojKsyUdp
@jmkanz a couple of things:
- a better forum may be (if possible) an openshift Jira ticket, as this forum is really more for upstream cri-o, and these versions are out of upstream support
- can you help me put together a more minimal reproducer? I attempted to install the nvidia operator, and created a clusterpolicy and nvidia driver instance, but I wonder if I did the right steps as I'm getting different failures (and I doubt the cluster I installed has GPUs to provision)
- I also tried to use a ubi8 image and I was able to install packages (elfutils was installed in ubi8 base, but I could install other packages, and I could also install it in ubi8-minimal with microdnf). I do get warnings about not having entitlement certs (
Found 0 entitlement certificates
), but that's a different one than you are hitting
- I also tried to use a ubi8 image and I was able to install packages (elfutils was installed in ubi8 base, but I could install other packages, and I could also install it in ubi8-minimal with microdnf). I do get warnings about not having entitlement certs (
@jmkanz can you post the status of all pods in the cluster please (specially coredns).
cash-worker-cf-01.c1-ocp.sl.cloud9.ibm.com Ready,SchedulingDisabled canary,worker 27h v1.25.16+9946c63 169.60.156.4 <none> Red Hat Enterprise Linux CoreOS 412.86.202403280709-0 (Ootpa) 4.18.0-372.98.1.el8_6.x86_64 cri-o://1.25.5-13.1.rhaos4.12.git76343da.el8
GPU Operator does seem to cordon the node in this case, so wondering if any networking pods are being evicted, which will cause the driver install to fail.
@shivamerla - I've manually cordoned this node since i've updated it to the latest version of Open Shift 4.12
The cordon should not impact the nvidia pods as they run as daemonsets. I've cordoned other nodes in the cluster as well (that are on a older version of CoreOS) and they run fine with or without the cordon.
Additionally, other pods are running fine on the node. I can give you an output of them. Please see below: edit to sanitize IP's
oc get pods -A -o wide |grep cash-worker-cf-01.c1-ocp.sl.cloud9.ibm.com
ibm-object-s3fs ibmcloud-object-storage-driver-l925j 1/1 Running 0 126m
ibm-observe logdna-agent-8xssh 1/1 Running 3 2d3h
ibm-observe sysdig-agent-n6v9s 1/1 Running 3 2d3h
jeg kernel-image-puller-5bfpp 1/1 Running 3 2d3h
kube-system istio-cni-node-cl5sj 1/1 Running 3 2d3h
nvidia-gpu-operator gpu-feature-discovery-rcdcf 0/1 Init:0/1 0 25h
nvidia-gpu-operator nvidia-container-toolkit-daemonset-lxpm7 0/1 Init:0/1 0
nvidia-gpu-operator nvidia-dcgm-exporter-n227d 0/1 Init:0/1 0 25h
nvidia-gpu-operator nvidia-dcgm-kr6v8 0/1 Init:0/1 0 25h
nvidia-gpu-operator nvidia-device-plugin-daemonset-stgwm 0/1 Init:0/1 0 25h
nvidia-gpu-operator nvidia-driver-daemonset-kz5th 0/1 CrashLoopBackOff 558 (14s ago) 2d1h
nvidia-gpu-operator nvidia-node-status-exporter-rcmdj 1/1 Running 3 2d3h
nvidia-gpu-operator nvidia-operator-validator-hpkvt 0/1 Init:0/4 0 25h
openshift-cluster-node-tuning-operator tuned-vnkmb 1/1 Running 1 47h
openshift-dns dns-default-4vdm9 2/2 Running 2 47h
openshift-dns node-resolver-l5vtp 1/1 Running 1 47h
openshift-image-registry node-ca-zmnhr 1/1 Running 1 47h
openshift-ingress-canary ingress-canary-xrdwz 1/1 Running 1 47h
openshift-machine-config-operator machine-config-daemon-bpj6x 2/2 Running 2
openshift-monitoring node-exporter-djqxn 2/2 Running 2 47h
openshift-multus multus-additional-cni-plugins-zhw2l 1/1 Running 1 47h
openshift-multus multus-ddwks 1/1 Running 1 47h
openshift-multus network-metrics-daemon-cbj56 2/2 Running 2 47h
openshift-network-diagnostics network-check-target-c6snb 1/1 Running 1 47h
openshift-nfd nfd-worker-pflkc 1/1 Running 3 2d3h
openshift-sdn sdn-j7kng 2/2 Running 2 47h
openshift-storage csi-cephfsplugin-fndsj 2/2 Running 6 2d3h
openshift-storage csi-rbdplugin-5c6ml 3/3 Running 9 2d3h
tekton-pipelines pwa-2r24x 1/1 Running 3 2d3h
DNS Pods for the cluster as well:
╰$ oc get pods -A |grep dns
openshift-dns-operator dns-operator-7f86f6f997-766l4 2/2 Running 0 47h
openshift-dns dns-default-4n6hq 2/2 Running 0 47h
openshift-dns dns-default-4vdm9 2/2 Running 2 47h
openshift-dns dns-default-7v9rx 2/2 Running 0 47h
openshift-dns dns-default-9wwps 2/2 Running 0 47h
openshift-dns dns-default-bcv7s 2/2 Running 2 47h
openshift-dns dns-default-bzsmp 2/2 Running 0 47h
openshift-dns dns-default-csrpd 2/2 Running 0 47h
openshift-dns dns-default-d677l 2/2 Running 0 47h
openshift-dns dns-default-dv45x 2/2 Running 0 47h
openshift-dns dns-default-j7xcv 2/2 Running 4 47h
openshift-dns dns-default-jb62l 2/2 Running 0 47h
openshift-dns dns-default-lkq76 2/2 Running 2 47h
openshift-dns dns-default-lpsfq 2/2 Running 0 47h
openshift-dns dns-default-m6hr9 2/2 Running 0 47h
openshift-dns dns-default-pf825 2/2 Running 0 47h
openshift-dns dns-default-tj4bw 2/2 Running 0 47h
openshift-dns dns-default-zjpsz 2/2 Running 2 47h
openshift-dns dns-default-zl52j 2/2 Running 0 47h
openshift-dns dns-default-zsgx8 2/2 Running 0 47h
openshift-dns node-resolver-2vsc8 1/1 Running 1 47h
openshift-dns node-resolver-48nkh 1/1 Running 0 47h
openshift-dns node-resolver-59vb8 1/1 Running 0 47h
openshift-dns node-resolver-74btd 1/1 Running 0 47h
openshift-dns node-resolver-c5d4p 1/1 Running 0 47h
openshift-dns node-resolver-c5q44 1/1 Running 0 47h
openshift-dns node-resolver-clck8 1/1 Running 0 47h
openshift-dns node-resolver-fjgnb 1/1 Running 1 47h
openshift-dns node-resolver-g54sd 1/1 Running 0 47h
openshift-dns node-resolver-gd6rk 1/1 Running 0 47h
openshift-dns node-resolver-gs4z2 1/1 Running 2 47h
openshift-dns node-resolver-l5n4z 1/1 Running 0 47h
openshift-dns node-resolver-l5vtp 1/1 Running 1 47h
openshift-dns node-resolver-l9kjc 1/1 Running 0 47h
openshift-dns node-resolver-rls8p 1/1 Running 1 47h
openshift-dns node-resolver-tr6wf 1/1 Running 0 47h
openshift-dns node-resolver-vlnj4 1/1 Running 0 47h
openshift-dns node-resolver-whzp6 1/1 Running 0 47h
openshift-dns node-resolver-wrnfs 1/1 Running 0 47h
openshift-dns node-resolver-xqwxs 1/1 Running 0 47h
openshift-dns node-resolver-zhb5c 1/1 Running 0 47h
openshift-dns node-resolver-zm4m8 1/1 Running 0 47h
openshift-dns node-resolver-zph8p 1/1 Running 0 47h
@jmkanz a couple of things:
a better forum may be (if possible) an openshift Jira ticket, as this forum is really more for upstream cri-o, and these versions are out of upstream support
can you help me put together a more minimal reproducer? I attempted to install the nvidia operator, and created a clusterpolicy and nvidia driver instance, but I wonder if I did the right steps as I'm getting different failures (and I doubt the cluster I installed has GPUs to provision)
- I also tried to use a ubi8 image and I was able to install packages (elfutils was installed in ubi8 base, but I could install other packages, and I could also install it in ubi8-minimal with microdnf). I do get warnings about not having entitlement certs (
Found 0 entitlement certificates
), but that's a different one than you are hitting
Hey @haircommander - Thanks for your reply. We can move the conversation over to the other git issue in the CRI-O repo if you prefer? This is the NVIDIA one. Additionally, if your cluster doesn't have GPU's I doubt the install will even begin since you need the correct labels from NFD operator for GPU enabled workers.
I believe @KodieGlosserIBM has the issue open in JIRA with Red Hat still
ah I thought I was commenting there :upside_down_face: . this is fine too if this feels right.
Still wondering about a more minimal reproducer, potentially without nvidia operator in the picture. Or, if you could help me get access to the environment with this failing, that would work too
This seems to be resolved. I noticed this cluster had a ClusterPolicy that was not using the OCP Driver Toolkit which prevents these entitlement issues.
Please see links below for more information
Open Shift Driver Toolkit Info: https://docs.openshift.com/container-platform/4.12/hardware_enablement/psap-driver-toolkit.html
NVIDIA Docs on Installation with / without Driver Toolkit: https://docs.nvidia.com/datacenter/cloud-native/openshift/latest/steps-overview.html
@jmkanz and @KodieGlosserIBM, thank you for the update!
Good to know that things are working fine. :tada:
Hello everyone! :wave: Are we still having issues with the operator installation on CRI-O 1.25 and 1.26?
I think this problem has been resolved, and we could close this issue? Thoughts?