Openshift with A100 GPU driver pod not getting ready and nvidia-smi output "No device were found"
1. Quick Debug Information
- OS/Version: Red Hat Enterprise Linux CoreOS release 4.12
- Kernel Version: 4.18.0-372.82.1.el8_6.x86_64
- Container Runtime Type/Version: CRI-O
- K8s Flavor/Version: Openshift Server Version: 4.12.45, Kubernetes Version: v1.25.14+a52e8df
- GPU Operator Version: v23.9.1, v23.6.1
- GPU node: A100
2. Issue or feature description
GPU-operator's driver pod failed to get ready, cluster policy installed default with "use_ocp_driver_toolkit" as selected, driver pod is not getting ready because its Startup probe failed: No devices were found GPU node is of type A100
[root@worker6 driver]# lspci | grep -i nvidia
13:00.0 3D controller: NVIDIA Corporation GA100 [A100 PCIe 40GB] (rev a1)
[root@worker6 driver]#
since the driver pod is not fully ready(container nvidia-driver-ctr is not getting ready) other pods are stuck in init state
[core@master0 tmp]$ kubectl get po -n nvidia-gpu-operator
NAME READY STATUS RESTARTS AGE
gpu-feature-discovery-mdkg4 0/1 Init:0/1 0 52m
gpu-operator-595587c664-96gsn 1/1 Running 0 114m
nvidia-container-toolkit-daemonset-vpmlh 0/1 Init:0/1 0 52m
nvidia-dcgm-2gxdj 0/1 Init:0/1 0 52m
nvidia-dcgm-exporter-q96qf 0/1 Init:0/2 0 52m
nvidia-device-plugin-daemonset-stb74 0/1 Init:0/1 0 52m
nvidia-driver-daemonset-412.86.202311271639-0-qkdsz 1/2 Running 2 (11m ago) 53m
nvidia-node-status-exporter-fmnkb 1/1 Running 0 53m
nvidia-operator-validator-h8kwp 0/1 Init:0/4 0 52m
[core@master0 tmp]$
[core@master0 tmp]$ kubectl exec -ti nvidia-driver-daemonset-412.86.202311271639-0-qkdsz -n nvidia-gpu-operator -- nvidia-smi
No devices were found
command terminated with exit code 6
[core@master0 tmp]$
We have tried with use_ocp_driver_toolkit set to "false" and applied the entitlement but even with it driver pod failed to download packages "kernel-headers-4.18.0-372.82.1.el8_6.x86_64 kernel-devel-4.18.0-372.82.1.el8_6.x86_64" and went into crash-loopback state, the following error is coming
Installing Linux kernel headers...
+ echo 'Installing Linux kernel headers...'
+ dnf -q -y --releasever=8.6 install kernel-headers-4.18.0-372.82.1.el8_6.x86_64 kernel-devel-4.18.0-372.82.1.el8_6.x86_64
Error: Unable to find a match: kernel-headers-4.18.0-372.82.1.el8_6.x86_64 kernel-devel-4.18.0-372.82.1.el8_6.x86_64
++ rm -rf /tmp/tmp.ffQrW5HEcg
3. Steps to reproduce the issue
Use following manifest YML to create cluster-policy
apiVersion: nvidia.com/v1
kind: ClusterPolicy
metadata:
name: gpu-cluster-policy
spec:
vgpuDeviceManager:
config:
default: default
enabled: true
migManager:
config:
default: all-disabled
name: default-mig-parted-config
enabled: true
operator:
defaultRuntime: crio
initContainer: {}
runtimeClass: nvidia
use_ocp_driver_toolkit: true
dcgm:
enabled: true
gfd:
enabled: true
dcgmExporter:
config:
name: ''
enabled: true
serviceMonitor:
enabled: true
cdi:
default: false
enabled: false
driver:
certConfig:
name: ''
enabled: true
kernelModuleConfig:
name: ''
licensingConfig:
configMapName: ''
nlsEnabled: false
repoConfig:
configMapName: ''
upgradePolicy:
autoUpgrade: true
drain:
deleteEmptyDir: false
enable: false
force: false
timeoutSeconds: 300
maxParallelUpgrades: 1
maxUnavailable: 25%
podDeletion:
deleteEmptyDir: false
force: false
timeoutSeconds: 300
waitForCompletion:
timeoutSeconds: 0
virtualTopology:
config: ''
devicePlugin:
config:
default: ''
name: ''
enabled: true
kataManager:
config:
artifactsDir: /opt/nvidia-gpu-operator/artifacts/runtimeclasses
mig:
strategy: single
sandboxDevicePlugin:
enabled: true
validator:
plugin:
env:
- name: WITH_WORKLOAD
value: 'false'
nodeStatusExporter:
enabled: true
daemonsets:
rollingUpdate:
maxUnavailable: '1'
updateStrategy: RollingUpdate
sandboxWorkloads:
defaultWorkload: container
enabled: false
gds:
enabled: false
vgpuManager:
enabled: false
vfioManager:
enabled: true
toolkit:
enabled: true
installDir: /usr/local/nvidia
status:
namespace: nvidia-gpu-operator
state: notReady
let us know if more information is required.
No devices were found typically indicates the driver failed to initialize. Can you collect system logs by running dmesg | grep -i nvrm on the host?