gpu-operator
gpu-operator copied to clipboard
No devices were found in openshift
1. Quick Debug Information
- OS/Version Red Hat Enterprise Linux CoreOS release 4.12
- Kernel Version: 4.18.0-372.69.1.el8_6.x86_64
- Container Runtime Type/Version: CRI-O
- Openshift 4.12.29
- GPU Operator Version: 23.9.1
2. Issue or feature description
nvidia-driver-daemonset-xx pod reports "Startup probe failed: No devices were found" in events, but I can see the v100 GPU is ready on the os, below is the "lspci" output
03:00.0 Serial Attached SCSI controller: VMware PVSCSI SCSI Controller (rev 02)
0b:00.0 Ethernet controller: VMware VMXNET3 Ethernet Controller (rev 01)
13:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 PCIe 32GB] (rev a1)
3. Steps to reproduce the issue
Deploy the GPU operator, cluster-policy definition.
apiVersion: nvidia.com/v1
kind: ClusterPolicy
metadata:
creationTimestamp: '2023-12-20T13:06:29Z'
generation: 2
name: gpu-cluster-policy
resourceVersion: '275859864'
uid: 71e06b17-5b47-4ab0-aae9-8034a2e30e42
spec:
vgpuDeviceManager:
config:
default: default
enabled: true
migManager:
config:
default: all-disabled
name: default-mig-parted-config
enabled: true
operator:
defaultRuntime: crio
initContainer: {}
runtimeClass: nvidia
use_ocp_driver_toolkit: true
dcgm:
enabled: true
gfd:
enabled: true
dcgmExporter:
config:
name: ''
enabled: true
serviceMonitor:
enabled: true
cdi:
default: false
enabled: false
driver:
licensingConfig:
configMapName: ''
nlsEnabled: false
enabled: true
certConfig:
name: ''
repository: nvcr.io/nvidia
kernelModuleConfig:
name: ''
usePrecompiled: false
upgradePolicy:
autoUpgrade: false
drain:
deleteEmptyDir: false
enable: false
force: false
timeoutSeconds: 300
maxParallelUpgrades: 1
maxUnavailable: 25%
podDeletion:
deleteEmptyDir: false
force: false
timeoutSeconds: 300
waitForCompletion:
timeoutSeconds: 0
repoConfig:
configMapName: ''
version: 535.104.05
virtualTopology:
config: ''
image: driver
devicePlugin:
config:
default: ''
name: ''
enabled: true
mig:
strategy: single
sandboxDevicePlugin:
enabled: true
validator:
plugin:
env:
- name: WITH_WORKLOAD
value: 'true'
nodeStatusExporter:
enabled: true
daemonsets:
rollingUpdate:
maxUnavailable: '1'
updateStrategy: RollingUpdate
sandboxWorkloads:
defaultWorkload: container
enabled: false
gds:
enabled: false
vgpuManager:
enabled: false
vfioManager:
enabled: true
toolkit:
enabled: true
installDir: /usr/local/nvidia
@garyyang85 No devices were found typically indicates that GPU initialization failed. Can you get system logs by running dmesg | grep -i nvrm on the host?