gpu-operator
gpu-operator copied to clipboard
Nvidia Operator fails to detect the vGPU devices on OpenShift Cluster with A100 GPU node
Problem Statement
We're trying to setup the vGPU for VM workload on the OpenShift cluster but at present we're seeing nvidia-sandbox-validator pods in init state.
Infrastructure details
- Bare metal OpenShift cluster 4.16.4
- 2 GPU nodes with A100 GPU cards
- Node Feature Discovery operator installed 4.16.0
- OpenShift Virtualization operator installed 4.16.6
- Nvidia gpu operator installed 24.9.2
Use Case At present we're trying to setup vGPU for VM workload on the OpenShift Cluster for OpenShift Virtualization VM's.
Steps to reproduce
- Bare Metal OpenShift Cluster with A100 GPU nodes
- Install Node Feature Discovery operator and create CR
- Install Nvidia Operator and create the cluster policy.
- Build the Nvidia driver image downloading the necessary software from the Nvidia website.
- After a while we see below pods in Init state with
vgpu-devices-validationwith messageNo vGPU devices found, retrying after 5 seconds
Applied cluster policy
apiVersion: nvidia.com/v1
kind: ClusterPolicy
metadata:
creationTimestamp: '2025-03-23T16:52:10Z'
generation: 4
managedFields:
- apiVersion: nvidia.com/v1
fieldsType: FieldsV1
fieldsV1:
'f:status':
.: {}
'f:conditions': {}
'f:namespace': {}
'f:state': {}
manager: gpu-operator
operation: Update
subresource: status
time: '2025-03-23T16:52:29Z'
- apiVersion: nvidia.com/v1
fieldsType: FieldsV1
fieldsV1:
'f:spec':
'f:vgpuManager':
.: {}
'f:enabled': {}
'f:image': {}
'f:imagePullSecrets': {}
'f:repository': {}
'f:version': {}
'f:vfioManager':
.: {}
'f:enabled': {}
'f:daemonsets':
.: {}
'f:updateStrategy': {}
'f:sandboxWorkloads':
.: {}
'f:defaultWorkload': {}
'f:enabled': {}
'f:nodeStatusExporter':
.: {}
'f:enabled': {}
'f:toolkit':
.: {}
'f:enabled': {}
'f:installDir': {}
'f:vgpuDeviceManager':
.: {}
'f:config':
.: {}
'f:default': {}
'f:name': {}
'f:enabled': {}
.: {}
'f:gfd': {}
'f:migManager':
.: {}
'f:enabled': {}
'f:mig':
.: {}
'f:strategy': {}
'f:operator':
.: {}
'f:defaultRuntime': {}
'f:initContainer': {}
'f:runtimeClass': {}
'f:use_ocp_driver_toolkit': {}
'f:dcgm':
.: {}
'f:enabled': {}
'f:dcgmExporter': {}
'f:sandboxDevicePlugin':
.: {}
'f:enabled': {}
'f:driver':
.: {}
'f:enabled': {}
'f:devicePlugin': {}
'f:validator':
.: {}
'f:plugin':
.: {}
'f:env': {}
manager: Mozilla
operation: Update
time: '2025-03-25T12:11:42Z'
name: gpu-cluster-policy
resourceVersion: '87806614'
uid: c912e2cb-e05d-4675-9ea7-ebc8b217129a
spec:
vgpuDeviceManager:
config:
default: default
name: vgpu-devices-config
enabled: false
migManager:
enabled: true
operator:
defaultRuntime: crio
initContainer: {}
runtimeClass: nvidia
use_ocp_driver_toolkit: true
dcgm:
enabled: true
gfd: {}
dcgmExporter: {}
driver:
enabled: false
devicePlugin: {}
mig:
strategy: single
sandboxDevicePlugin:
enabled: false
validator:
plugin:
env:
- name: WITH_WORKLOAD
value: 'true'
nodeStatusExporter:
enabled: true
daemonsets:
updateStrategy: RollingUpdate
sandboxWorkloads:
defaultWorkload: vm-vgpu
enabled: true
vgpuManager:
enabled: true
image: vgpu-manager
imagePullSecrets:
- demo-secret
repository: <image-reference-built-from-nvidia-software-removed-here-for-now>
version: 570.124.03
vfioManager:
enabled: false
toolkit:
enabled: true
installDir: /usr/local/nvidia
status:
conditions:
- lastTransitionTime: '2025-03-23T16:52:25Z'
message: ''
reason: Error
status: 'False'
type: Ready
- lastTransitionTime: '2025-03-23T16:52:25Z'
message: 'ClusterPolicy is not ready, states not ready: [state-sandbox-validation]'
reason: OperandNotReady
status: 'True'
type: Error
namespace: nvidia-gpu-operator
state: notReady
Observations We've the vGPU devices listed in the directory /sys/bus/pci/devices but seems somehow this are not getting reflected.
Logs of the nvidia-sandbox-validator pod container vgpu-devices-validation
time="2025-03-30T08:05:02Z" level=info msg="version: a7551902, commit: a755190"
time="2025-03-30T08:05:02Z" level=info msg="GPU workload configuration: vm-vgpu"
time="2025-03-30T08:05:02Z" level=info msg="No vGPU devices found, retrying after 5 seconds"
time="2025-03-30T08:05:07Z" level=info msg="No vGPU devices found, retrying after 5 seconds"
time="2025-03-30T08:05:12Z" level=info msg="No vGPU devices found, retrying after 5 seconds"
time="2025-03-30T08:05:17Z" level=info msg="No vGPU devices found, retrying after 5 seconds"
time="2025-03-30T08:05:22Z" level=info msg="No vGPU devices found, retrying after 5 seconds"
time="2025-03-30T08:05:27Z" level=info msg="No vGPU devices found, retrying after 5 seconds"
time="2025-03-30T08:05:32Z" level=info msg="No vGPU devices found, retrying after 5 seconds"
time="2025-03-30T08:05:37Z" level=info msg="No vGPU devices found, retrying after 5 seconds"
time="2025-03-30T08:05:42Z" level=info msg="No vGPU devices found, retrying after 5 seconds"
Let me know if I'm missing any steps.