gpu-operator
gpu-operator copied to clipboard
error getting vGPU config: error getting all vGPU devices: unable to read MDEV devices directory: open /sys/bus/mdev/devices: no such file or directory
1. Quick Debug Information
- OS/Version(e.g. RHEL8.6, Ubuntu22.04): Red Hat Enterprise Linux CoreOS release 4.11
- Kernel Version: Linux 4.18.0-372.46.1.el8_6.x86_64
- Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): CRI-O
- K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): OCP 4.11
- GPU Operator Version: 23.6.1 provided by NVIDIA Corporation
2. Issue or feature description
We can't configure the vGPUs using NVIDIA operator following the docs here https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/openshift/openshift-virtualization.html
3. Steps to reproduce the issue
- Install NVIDIA operator and create ClusterPolicy with the following parameters for the vGPUs
sandboxWorloads.enabled=true
vgpuManager.enabled=true
vgpuManager.repository=<path to private repository>
vgpuManager.image=vgpu-manager
vgpuManager.version=<driver version>
vgpuManager.imagePullSecrets={<name of image pull secret>}
This is our cluster policy
apiVersion: nvidia.com/v1
kind: ClusterPolicy
metadata:
name: gpu-cluster-policy
spec:
vgpuDeviceManager:
config:
default: default
enabled: true
migManager:
config:
default: all-disabled
name: default-mig-parted-config
enabled: true
operator:
defaultRuntime: crio
initContainer: {}
runtimeClass: nvidia
use_ocp_driver_toolkit: true
dcgm:
enabled: true
gfd:
enabled: true
dcgmExporter:
config:
name: ''
enabled: true
serviceMonitor:
enabled: true
cdi:
default: false
enabled: false
driver:
certConfig:
name: ''
enabled: true
kernelModuleConfig:
name: ''
licensingConfig:
configMapName: licensing-config
nlsEnabled: true
repoConfig:
configMapName: ''
upgradePolicy:
autoUpgrade: true
drain:
deleteEmptyDir: false
enable: false
force: false
timeoutSeconds: 300
maxParallelUpgrades: 1
maxUnavailable: 25%
podDeletion:
deleteEmptyDir: false
force: false
timeoutSeconds: 300
waitForCompletion:
timeoutSeconds: 0
virtualTopology:
config: ''
devicePlugin:
config:
default: ''
name: ''
enabled: true
kataManager:
config:
artifactsDir: /opt/nvidia-gpu-operator/artifacts/runtimeclasses
mig:
strategy: single
sandboxDevicePlugin:
enabled: true
validator:
plugin:
env:
- name: WITH_WORKLOAD
value: 'false'
nodeStatusExporter:
enabled: true
daemonsets:
rollingUpdate:
maxUnavailable: '1'
updateStrategy: RollingUpdate
sandboxWorkloads:
defaultWorkload: vm-vgpu
enabled: true
gds:
enabled: false
vgpuManager:
driverManager:
image: vgpu-manager
repository: default-route-openshift-image-registry.apps.ocp4.poc.site/pp-testing
version: 535.104.06-rhcos4.11
enabled: true
vfioManager:
enabled: true
toolkit:
enabled: true
installDir: /usr/local/nvidia
4. Debug info
4.1 When we specify this label nvidia.com/vgpu.config=A100-1-5C
for each node
oc logs -f nvidia-vgpu-device-manager-69wm6
Defaulted container "nvidia-vgpu-device-manager" out of: nvidia-vgpu-device-manager, vgpu-manager-validation (init)
W0928 14:49:52.314862 1 client_config.go:617] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work.
time="2023-09-28T14:49:52Z" level=info msg="Updating to vGPU config: A100-1-5CME"
time="2023-09-28T14:49:52Z" level=info msg="Asserting that the requested configuration is present in the configuration file"
time="2023-09-28T14:49:52Z" level=info msg="Selected vGPU device configuration is valid"
time="2023-09-28T14:49:52Z" level=info msg="Checking if the selected vGPU device configuration is currently applied or not"
time="2023-09-28T14:49:52Z" level=fatal msg="Assertion failure: selected configuration not currently applied"
time="2023-09-28T14:49:52Z" level=info msg="Getting current value of 'nvidia.com/gpu.deploy.sandbox-device-plugin' node label"
time="2023-09-28T14:49:52Z" level=info msg="Current value of 'nvidia.com/gpu.deploy.sandbox-device-plugin=true'"
time="2023-09-28T14:49:52Z" level=info msg="Getting current value of 'nvidia.com/gpu.deploy.sandbox-validator' node label"
time="2023-09-28T14:49:52Z" level=info msg="Current value of 'nvidia.com/gpu.deploy.sandbox-validator=true'"
time="2023-09-28T14:49:52Z" level=info msg="Getting current value of 'nvidia.com/vgpu.config.state' node label"
time="2023-09-28T14:49:52Z" level=info msg="Current value of 'nvidia.com/vgpu.config.state=failed'"
time="2023-09-28T14:49:52Z" level=info msg="Changing the 'nvidia.com/vgpu.config.state' node label to 'pending'"
time="2023-09-28T14:49:52Z" level=info msg="Shutting down all GPU operands in Kubernetes by disabling their component-specific nodeSelector labels"
time="2023-09-28T14:49:52Z" level=info msg="Waiting for sandbox-device-plugin to shutdown"
time="2023-09-28T14:50:23Z" level=info msg="Waiting for sandbox-validator to shutdown"
time="2023-09-28T14:50:23Z" level=info msg="Applying the selected vGPU device configuration to the node"
time="2023-09-28T14:50:23Z" level=debug msg="Parsing config file..."
time="2023-09-28T14:50:23Z" level=debug msg="Selecting specific vGPU config..."
time="2023-09-28T14:50:23Z" level=debug msg="Checking current vGPU device configuration..."
time="2023-09-28T14:50:23Z" level=debug msg="Walking VGPUConfig for (devices=all)"
time="2023-09-28T14:50:23Z" level=debug msg=" GPU 0: 0x20B510DE"
time="2023-09-28T14:50:23Z" level=info msg="Applying vGPU device configuration..."
time="2023-09-28T14:50:23Z" level=debug msg="Walking VGPUConfig for (devices=all)"
time="2023-09-28T14:50:23Z" level=debug msg=" GPU 0: 0x20B510DE"
time="2023-09-28T14:50:23Z" level=fatal msg="error getting vGPU config: error getting all vGPU devices: unable to read MDEV devices directory: open /sys/bus/mdev/devices: no such file or directory"
time="2023-09-28T14:50:23Z" level=info msg="Changing the 'nvidia.com/vgpu.config.state' node label to 'failed'"
time="2023-09-28T14:50:23Z" level=error msg="ERROR: unable to apply config 'A100-1-5CME': exit status 1"
time="2023-09-28T14:50:23Z" level=info msg="Waiting for change to 'nvidia.com/vgpu.config' label"
4.1 When we don't specify any specific gpu labels and let the nvidia operator handle the selection
oc logs -f nvidia-vgpu-device-manager-hmqjt -c vgpu-manager-validation
waiting for NVIDIA vGPU Manager to be setup
waiting for NVIDIA vGPU Manager to be setup
waiting for NVIDIA vGPU Manager to be setup
waiting for NVIDIA vGPU Manager to be setup
waiting for NVIDIA vGPU Manager to be setup
waiting for NVIDIA vGPU Manager to be setup
waiting for NVIDIA vGPU Manager to be setup
waiting for NVIDIA vGPU Manager to be setup
^C
oc logs -f nvidia-vgpu-device-manager-q8khn -c vgpu-manager-validation
waiting for NVIDIA vGPU Manager to be setup
waiting for NVIDIA vGPU Manager to be setup
waiting for NVIDIA vGPU Manager to be setup
waiting for NVIDIA vGPU Manager to be setup
waiting for NVIDIA vGPU Manager to be setup
waiting for NVIDIA vGPU Manager to be setup
waiting for NVIDIA vGPU Manager to be setup
waiting for NVIDIA vGPU Manager to be setup
waiting for NVIDIA vGPU Manager to be setup
waiting for NVIDIA vGPU Manager to be setup
waiting for NVIDIA vGPU Manager to be setup
waiting for NVIDIA vGPU Manager to be setup
^C
@ppetko can you check logs of vgpu-manager
pod to make sure if it is installed successfully?
Hi @shivamerla ,
It looks like it failed.
oc logs -f nvidia-vgpu-device-manager-69wm6
Defaulted container "nvidia-vgpu-device-manager" out of: nvidia-vgpu-device-manager, vgpu-manager-validation (init)
W0928 14:49:52.314862 1 client_config.go:617] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work.
time="2023-09-28T14:49:52Z" level=info msg="Updating to vGPU config: A100-1-5CME"
time="2023-09-28T14:49:52Z" level=info msg="Asserting that the requested configuration is present in the configuration file"
time="2023-09-28T14:49:52Z" level=info msg="Selected vGPU device configuration is valid"
time="2023-09-28T14:49:52Z" level=info msg="Checking if the selected vGPU device configuration is currently applied or not"
time="2023-09-28T14:49:52Z" level=fatal msg="Assertion failure: selected configuration not currently applied"
time="2023-09-28T14:49:52Z" level=info msg="Getting current value of 'nvidia.com/gpu.deploy.sandbox-device-plugin' node label"
time="2023-09-28T14:49:52Z" level=info msg="Current value of 'nvidia.com/gpu.deploy.sandbox-device-plugin=true'"
time="2023-09-28T14:49:52Z" level=info msg="Getting current value of 'nvidia.com/gpu.deploy.sandbox-validator' node label"
time="2023-09-28T14:49:52Z" level=info msg="Current value of 'nvidia.com/gpu.deploy.sandbox-validator=true'"
time="2023-09-28T14:49:52Z" level=info msg="Getting current value of 'nvidia.com/vgpu.config.state' node label"
time="2023-09-28T14:49:52Z" level=info msg="Current value of 'nvidia.com/vgpu.config.state=failed'"
time="2023-09-28T14:49:52Z" level=info msg="Changing the 'nvidia.com/vgpu.config.state' node label to 'pending'"
time="2023-09-28T14:49:52Z" level=info msg="Shutting down all GPU operands in Kubernetes by disabling their component-specific nodeSelector labels"
time="2023-09-28T14:49:52Z" level=info msg="Waiting for sandbox-device-plugin to shutdown"
time="2023-09-28T14:50:23Z" level=info msg="Waiting for sandbox-validator to shutdown"
time="2023-09-28T14:50:23Z" level=info msg="Applying the selected vGPU device configuration to the node"
time="2023-09-28T14:50:23Z" level=debug msg="Parsing config file..."
time="2023-09-28T14:50:23Z" level=debug msg="Selecting specific vGPU config..."
time="2023-09-28T14:50:23Z" level=debug msg="Checking current vGPU device configuration..."
time="2023-09-28T14:50:23Z" level=debug msg="Walking VGPUConfig for (devices=all)"
time="2023-09-28T14:50:23Z" level=debug msg=" GPU 0: 0x20B510DE"
time="2023-09-28T14:50:23Z" level=info msg="Applying vGPU device configuration..."
time="2023-09-28T14:50:23Z" level=debug msg="Walking VGPUConfig for (devices=all)"
time="2023-09-28T14:50:23Z" level=debug msg=" GPU 0: 0x20B510DE"
time="2023-09-28T14:50:23Z" level=fatal msg="error getting vGPU config: error getting all vGPU devices: unable to read MDEV devices directory: open /sys/bus/mdev/devices: no such file or directory"
time="2023-09-28T14:50:23Z" level=info msg="Changing the 'nvidia.com/vgpu.config.state' node label to 'failed'"
time="2023-09-28T14:50:23Z" level=error msg="ERROR: unable to apply config 'A100-1-5CME': exit status 1"
time="2023-09-28T14:50:23Z" level=info msg="Waiting for change to 'nvidia.com/vgpu.config' label"
^C
@ppetko can you get logs from the vgpu-manager
pod, not the vgpu-device-manager
?
@cdesiniotis there is no such pod
oc get pods
NAME READY STATUS RESTARTS AGE
gpu-operator-fbb6ffcc8-gzddt 1/1 Running 0 6d23h
nvidia-sandbox-device-plugin-daemonset-s5v5b 1/1 Running 0 4d23h
nvidia-sandbox-validator-9tmn8 1/1 Running 0 4d23h
nvidia-vfio-manager-5j6wq 1/1 Running 0 4d23h
nvidia-vgpu-device-manager-69wm6 1/1 Running 0 4d23h
nvidia-vgpu-device-manager-w82ds 1/1 Running 0 4d23h
This is the cluster policy I'm using
apiVersion: nvidia.com/v1
kind: ClusterPolicy
metadata:
name: gpu-cluster-policy
spec:
vgpuDeviceManager:
config:
default: default
enabled: true
migManager:
config:
default: all-disabled
name: default-mig-parted-config
enabled: true
operator:
defaultRuntime: crio
initContainer: {}
runtimeClass: nvidia
use_ocp_driver_toolkit: true
dcgm:
enabled: true
gfd:
enabled: true
dcgmExporter:
config:
name: ''
enabled: true
serviceMonitor:
enabled: true
cdi:
default: false
enabled: false
driver:
certConfig:
name: ''
enabled: true
kernelModuleConfig:
name: ''
licensingConfig:
configMapName: licensing-config
nlsEnabled: true
repoConfig:
configMapName: ''
upgradePolicy:
autoUpgrade: true
drain:
deleteEmptyDir: false
enable: false
force: false
timeoutSeconds: 300
maxParallelUpgrades: 1
maxUnavailable: 25%
podDeletion:
deleteEmptyDir: false
force: false
timeoutSeconds: 300
waitForCompletion:
timeoutSeconds: 0
virtualTopology:
config: ''
devicePlugin:
config:
default: ''
name: ''
enabled: true
kataManager:
config:
artifactsDir: /opt/nvidia-gpu-operator/artifacts/runtimeclasses
mig:
strategy: single
sandboxDevicePlugin:
enabled: true
validator:
plugin:
env:
- name: WITH_WORKLOAD
value: 'false'
nodeStatusExporter:
enabled: true
daemonsets:
rollingUpdate:
maxUnavailable: '1'
updateStrategy: RollingUpdate
sandboxWorkloads:
defaultWorkload: vm-vgpu
enabled: true
gds:
enabled: false
vgpuManager:
driverManager:
image: vgpu-manager
repository: default-route-openshift-image-registry.apps.ocp4.poc.site/pp-testing
version: 535.104.06-rhcos4.11
enabled: true
vfioManager:
enabled: true
toolkit:
enabled: true
installDir: /usr/local/nvidia
Is vGPU manager already installed on the host (e.g. does running nvidia-smi
on the host return anything)?
Can you also describe your GPU nodes? In particular I am interested in the value of this node label nvidia.com/gpu.deploy.vgpu-manager
According to the docs, the vGPU manager should be deployed by the NVIDIA operator. In the ClusterPolicy
CR I build a container image for the vGPU manager.
oc describe node gpu4 | grep vgpu-manager
nvidia.com/gpu.deploy.vgpu-manager=true
These are all of the nvidia labels
oc describe node gpu4 | grep nvidia.com
nvidia.com/gpu.deploy.cc-manager=true
nvidia.com/gpu.deploy.nvsm=
nvidia.com/gpu.deploy.sandbox-device-plugin=paused-for-vgpu-change
nvidia.com/gpu.deploy.sandbox-validator=paused-for-vgpu-change
nvidia.com/gpu.deploy.vgpu-device-manager=true
nvidia.com/gpu.deploy.vgpu-manager=true
nvidia.com/gpu.present=true
nvidia.com/gpu.workload.config=vm-vgpu
nvidia.com/mig.config=all-disabled
nvidia.com/mig.config.state=success
nvidia.com/vgpu.config=A100-2-10C
**nvidia.com/vgpu.config.state=failed**
nvidia.com/A100: 0
nvidia.com/gpu: 0
nvidia.com/A100: 0
nvidia.com/gpu: 0
nvidia.com/A100 1 1
nvidia.com/gpu 0 0
Can you oc get ds -n nvidia-gpu-operator
and describe the vgpu-manager
daemonset?
oc get ds -n nvidia-gpu-operator
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
gpu-feature-discovery 0 0 0 0 0 nvidia.com/gpu.deploy.gpu-feature-discovery=true 90s
nvidia-container-toolkit-daemonset 0 0 0 0 0 nvidia.com/gpu.deploy.container-toolkit=true 90s
nvidia-dcgm 0 0 0 0 0 nvidia.com/gpu.deploy.dcgm=true 90s
nvidia-dcgm-exporter 0 0 0 0 0 nvidia.com/gpu.deploy.dcgm-exporter=true 90s
nvidia-device-plugin-daemonset 0 0 0 0 0 nvidia.com/gpu.deploy.device-plugin=true 90s
nvidia-driver-daemonset-411.86.202303060052-0 0 0 0 0 0 feature.node.kubernetes.io/system-os_release.OSTREE_VERSION=411.86.202303060052-0,nvidia.com/gpu.deploy.driver=true 90s
nvidia-mig-manager 0 0 0 0 0 nvidia.com/gpu.deploy.mig-manager=true 90s
nvidia-node-status-exporter 0 0 0 0 0 nvidia.com/gpu.deploy.node-status-exporter=true 90s
nvidia-operator-validator 0 0 0 0 0 nvidia.com/gpu.deploy.operator-validator=true 90s
nvidia-sandbox-device-plugin-daemonset 1 1 1 1 1 nvidia.com/gpu.deploy.sandbox-device-plugin=true 90s
nvidia-sandbox-validator 1 1 1 1 1 nvidia.com/gpu.deploy.sandbox-validator=true 90s
nvidia-vfio-manager 1 1 1 1 1 nvidia.com/gpu.deploy.vfio-manager=true 90s
nvidia-vgpu-device-manager 2 2 2 2 2 nvidia.com/gpu.deploy.vgpu-device-manager=true 90s
It looks like I don't have the daemonset for the vgpu-manager, that explains why I don't any pods. I have specified this label nvidia.com/vgpu.config=A100-2-10C
which I'm not sure if it's the correct one. If I leave this blank I'm getting the following
oc logs -f nvidia-vgpu-device-manager-hmqjt -c vgpu-manager-validation
waiting for NVIDIA vGPU Manager to be setup
waiting for NVIDIA vGPU Manager to be setup
waiting for NVIDIA vGPU Manager to be setup
waiting for NVIDIA vGPU Manager to be setup
waiting for NVIDIA vGPU Manager to be setup
waiting for NVIDIA vGPU Manager to be setup
waiting for NVIDIA vGPU Manager to be setup
waiting for NVIDIA vGPU Manager to be setup
^C
I have opened this case, but not much traction https://forums.developer.nvidia.com/t/rror-getting-vgpu-config-error-getting-all-vgpu-devices-unable-to-read-mdev-devices-directory-open-sys-bus-mdev-devices-no-such-file-or-directory/267696
This is the output of all resources in the namespace
oc get all
NAME READY STATUS RESTARTS AGE
pod/gpu-operator-fbb6ffcc8-gzddt 1/1 Running 0 7d2h
pod/nvidia-sandbox-device-plugin-daemonset-62rbg 1/1 Running 0 6m29s
pod/nvidia-sandbox-validator-s9zsr 1/1 Running 0 6m29s
pod/nvidia-vfio-manager-wjx99 1/1 Running 0 7m5s
pod/nvidia-vgpu-device-manager-g2xsd 1/1 Running 0 7m5s
pod/nvidia-vgpu-device-manager-tzpcf 1/1 Running 0 7m5s
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/gpu-operator ClusterIP 172.30.214.74 <none> 8080/TCP 7m5s
service/nvidia-dcgm-exporter ClusterIP 172.30.37.127 <none> 9400/TCP 7m5s
service/nvidia-node-status-exporter ClusterIP 172.30.62.146 <none> 8000/TCP 7m5s
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
daemonset.apps/gpu-feature-discovery 0 0 0 0 0 nvidia.com/gpu.deploy.gpu-feature-discovery=true 7m5s
daemonset.apps/nvidia-container-toolkit-daemonset 0 0 0 0 0 nvidia.com/gpu.deploy.container-toolkit=true 7m5s
daemonset.apps/nvidia-dcgm 0 0 0 0 0 nvidia.com/gpu.deploy.dcgm=true 7m5s
daemonset.apps/nvidia-dcgm-exporter 0 0 0 0 0 nvidia.com/gpu.deploy.dcgm-exporter=true 7m5s
daemonset.apps/nvidia-device-plugin-daemonset 0 0 0 0 0 nvidia.com/gpu.deploy.device-plugin=true 7m5s
daemonset.apps/nvidia-driver-daemonset-411.86.202303060052-0 0 0 0 0 0 feature.node.kubernetes.io/system-os_release.OSTREE_VERSION=411.86.202303060052-0,nvidia.com/gpu.deploy.driver=true 7m5s
daemonset.apps/nvidia-mig-manager 0 0 0 0 0 nvidia.com/gpu.deploy.mig-manager=true 7m5s
daemonset.apps/nvidia-node-status-exporter 0 0 0 0 0 nvidia.com/gpu.deploy.node-status-exporter=true 7m5s
daemonset.apps/nvidia-operator-validator 0 0 0 0 0 nvidia.com/gpu.deploy.operator-validator=true 7m5s
daemonset.apps/nvidia-sandbox-device-plugin-daemonset 1 1 1 1 1 nvidia.com/gpu.deploy.sandbox-device-plugin=true 7m5s
daemonset.apps/nvidia-sandbox-validator 1 1 1 1 1 nvidia.com/gpu.deploy.sandbox-validator=true 7m5s
daemonset.apps/nvidia-vfio-manager 1 1 1 1 1 nvidia.com/gpu.deploy.vfio-manager=true 7m5s
daemonset.apps/nvidia-vgpu-device-manager 2 2 2 2 2 nvidia.com/gpu.deploy.vgpu-device-manager=true 7m5s
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/gpu-operator 1/1 1 1 7d2h
NAME DESIRED CURRENT READY AGE
replicaset.apps/gpu-operator-fbb6ffcc8 1 1 1 7d2h
This doesn't seem right, if the node is labelled as nvidia.com/gpu.workload.config=vm-vgpu
then we deploy both "vgpu-manager" and "vgpu-device-manager". Here we see vfio-manager
getting deployed that happens only when the workload-config is vm-passthrough
. If you can share operator logs we can check why right operands are not getting deployed.
Ah, below section is wrong.
vgpuManager:
driverManager:
image: vgpu-manager
repository: default-route-openshift-image-registry.apps.ocp4.poc.site/pp-testing
version: 535.104.06-rhcos4.11
enabled: true
This should be
vgpuManager:
enabled: true
repository: "default-route-openshift-image-registry.apps.ocp4.poc.site/pp-testing"
image: vgpu-manager
version: "535.104.06-rhcos4.11"
imagePullPolicy: IfNotPresent
imagePullSecrets: []
env: []
resources: {}
Hm interesting - this yaml was generated by the clusterpolicy install using the UI.
Look at the logs below... Let me redeploy with the correct yaml file.
{"level":"error","ts":"2023-10-03T18:10:01Z","logger":"controllers.ClusterPolicy","msg":"Failed to apply transformation","Daemonset":"nvidia-vgpu-manager-daemonset","resource":"nvidia-vgpu-manager-daemonset","error":"failed to transform vGPU Manager container: Empty image path provided through both ClusterPolicy CR and ENV VGPU_MANAGER_IMAGE"}
{"level":"info","ts":"2023-10-03T18:10:01Z","logger":"controllers.ClusterPolicy","msg":"Could not pre-process","DaemonSet":"nvidia-vgpu-manager-daemonset","Namespace":"nvidia-gpu-operator","Error":"failed to transform vGPU Manager container: Empty image path provided through both ClusterPolicy CR and ENV VGPU_MANAGER_IMAGE"}
{"level":"error","ts":"2023-10-03T18:10:01Z","msg":"Reconciler error","controller":"clusterpolicy-controller","object":{"name":"gpu-cluster-policy"},"namespace":"","name":"gpu-cluster-policy","reconcileID":"62d09b2d-b745-4df4-bf74-dda2fd3c7cf2","error":"failed to handle OpenShift Driver Toolkit Daemonset for version 411.86.202303060052-0: failed to transform vGPU Manager container: Empty image path provided through both ClusterPolicy CR and ENV VGPU_MANAGER_IMAGE"}
{"level":"error","ts":"2023-10-03T18:10:01Z","logger":"controllers.ClusterPolicy","msg":"Failed to apply transformation","Daemonset":"nvidia-vgpu-manager-daemonset","resource":"nvidia-vgpu-manager-daemonset","error":"failed to transform vGPU Manager container: Empty image path provided through both ClusterPolicy CR and ENV VGPU_MANAGER_IMAGE"}
{"level":"info","ts":"2023-10-03T18:10:01Z","logger":"controllers.ClusterPolicy","msg":"Could not pre-process","DaemonSet":"nvidia-vgpu-manager-daemonset","Namespace":"nvidia-gpu-operator","Error":"failed to transform vGPU Manager container: Empty image path provided through both ClusterPolicy CR and ENV VGPU_MANAGER_IMAGE"}
{"level":"error","ts":"2023-10-03T18:10:01Z","msg":"Reconciler error","controller":"clusterpolicy-controller","object":{"name":"gpu-cluster-policy"},"namespace":"","name":"gpu-cluster-policy","reconcileID":"62d09b2d-b745-4df4-bf74-dda2fd3c7cf2","error":"failed to handle OpenShift Driver Toolkit Daemonset for version 411.86.202303060052-0: failed to transform vGPU Manager container: Empty image path provided through both ClusterPolicy CR and ENV VGPU_MANAGER_IMAGE"}
A little heads up in the docs would be nice that once you deploy the clusterpolicy, the operator will roll the cluster and restart each node. I see 2 new machine configs are applied and the cluster is trying to update. The problem it's stuck on a node that doesn't have a GPU. I have already loaded the kernel parameters for the GPUs using a machine config only for the nodes that contains a GPU.
What exactly are the machine configs trying to configure? Are there any docs on this process?
The kernel modules are already loaded
oc debug node/gpu1 -- chroot /host lspci -nnk -d 10de:
Starting pod/gpu1ocp4pocsite-debug ...
To use host binaries, run `chroot /host`
0000:31:00.0 VGA compatible controller [0300]: NVIDIA Corporation GA102GL [RTX A5000] [10de:2231] (rev a1)
Subsystem: NVIDIA Corporation Device [10de:147e]
Kernel driver in use: nvidia
Kernel modules: nouveau
oc debug node/gpu3 -- chroot /host lspci -nnk -d 10de:
Starting pod/gpu3ocp4pocsite-debug ...
To use host binaries, run `chroot /host`
1b:00.0 3D controller [0302]: NVIDIA Corporation GA100 [A100 PCIe 80GB] [10de:20b5] (rev a1)
Subsystem: NVIDIA Corporation Device [10de:1533]
Kernel driver in use: nvidia
Kernel modules: nouveau
1c:00.0 3D controller [0302]: NVIDIA Corporation GA100 [A100 PCIe 80GB] [10de:20b5] (rev a1)
Subsystem: NVIDIA Corporation Device [10de:1533]
Kernel driver in use: nvidia
Kernel modules: nouveau
Output of the mcp
oc get mcp
NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE
master rendered-master-bd4920bc82fa2273f8e79e3c851cba39 True False False 3 3 3 0 194d
worker rendered-worker-d242c91395c7a350afeeaab80b133966 False True True 5 3 3 1 194d
On the bright side, I think the deployment is fixed. I checked the clusterpolicy UI is generating a wrong clusterpolicy using the UI in version 23.6.1 provided by NVIDIA Corporation.
oc get pods
NAME READY STATUS RESTARTS AGE
gpu-operator-fbb6ffcc8-qdd5g 1/1 Running 0 48m
nvidia-sandbox-device-plugin-daemonset-66nzb 1/1 Running 0 40m
nvidia-sandbox-validator-v76vz 1/1 Running 0 40m
nvidia-vfio-manager-lxqzb 1/1 Running 0 41m
nvidia-vgpu-device-manager-44sxj 1/1 Running 0 14m
nvidia-vgpu-device-manager-qpdlx 1/1 Running 0 14m
nvidia-vgpu-manager-daemonset-411.86.202303060052-0-k52dq 2/2 Running 0 14m
nvidia-vgpu-manager-daemonset-411.86.202303060052-0-pfxgp 2/2 Running 0 14m
oc logs -f nvidia-vgpu-manager-daemonset-411.86.202303060052-0-k52dq
\Defaulted container "nvidia-vgpu-manager-ctr" out of: nvidia-vgpu-manager-ctr, openshift-driver-toolkit-ctr, k8s-driver-manager (init)
+ [[ '' == \t\r\u\e ]]
+ [[ ! -f /mnt/shared-nvidia-driver-toolkit/dir_prepared ]]
+ cp -r /usr/local/bin/ocp_dtk_entrypoint /usr/local/bin/nvidia-driver /driver /mnt/shared-nvidia-driver-toolkit/
+ env
+ sed 's/=/="/'
+ sed 's/$/"/'
+ touch /mnt/shared-nvidia-driver-toolkit/dir_prepared
+ set +x
Tue Oct 3 19:35:25 UTC 2023 Waiting for openshift-driver-toolkit-ctr container to start ...
Tue Oct 3 19:35:40 UTC 2023 openshift-driver-toolkit-ctr started.
+ sleep infinity
@ppetko AFAIK, we don't update MachineConfig at all from our code. What is the actual change that is being applied through MachineConfig? May be some other operator(OSV?) triggered that?
From what I can see, as soon as we applied the correct ClusterPolicy
CR, two new machine configs were created. But the configurations doesn't look related to the GPUs. So not sure what caused this.
oc get mc
NAME GENERATEDBYCONTROLLER IGNITIONVERSION AGE
00-master 624a49edf1d0eeca83d70c58faae25516fa25e20 3.2.0 195d
00-worker 624a49edf1d0eeca83d70c58faae25516fa25e20 3.2.0 195d
01-master-container-runtime 624a49edf1d0eeca83d70c58faae25516fa25e20 3.2.0 195d
01-master-kubelet 624a49edf1d0eeca83d70c58faae25516fa25e20 3.2.0 195d
01-worker-container-runtime 624a49edf1d0eeca83d70c58faae25516fa25e20 3.2.0 195d
01-worker-kubelet 624a49edf1d0eeca83d70c58faae25516fa25e20 3.2.0 195d
100-worker-iommu 3.2.0 194d
100-worker-vfiopci 3.2.0 194d
50-masters-chrony-configuration 3.1.0 195d
50-workers-chrony-configuration 3.1.0 195d
99-assisted-installer-master-ssh 3.1.0 195d
99-master-generated-crio-add-inheritable-capabilities 3.2.0 195d
99-master-generated-registries 624a49edf1d0eeca83d70c58faae25516fa25e20 3.2.0 195d
99-master-ssh 3.2.0 195d
99-worker-generated-crio-add-inheritable-capabilities 3.2.0 195d
99-worker-generated-registries 624a49edf1d0eeca83d70c58faae25516fa25e20 3.2.0 195d
99-worker-ssh 3.2.0 195d
rendered-master-4601510310247f17c4b2ee3ada9ca54f 624a49edf1d0eeca83d70c58faae25516fa25e20 3.2.0 195d
rendered-master-bd4920bc82fa2273f8e79e3c851cba39 624a49edf1d0eeca83d70c58faae25516fa25e20 3.2.0 19h
rendered-worker-06a98033c5f02d42ff75208c7b1db70c 624a49edf1d0eeca83d70c58faae25516fa25e20 3.2.0 19h
rendered-worker-20a3cea1f4b3d262015faf2610a652e1 624a49edf1d0eeca83d70c58faae25516fa25e20 3.2.0 195d
rendered-worker-a915ba541d8df2a6741b2f8507ea3928 624a49edf1d0eeca83d70c58faae25516fa25e20 3.2.0 194d
rendered-worker-d242c91395c7a350afeeaab80b133966 624a49edf1d0eeca83d70c58faae25516fa25e20 3.2.0 194d
Now the worker machine config pool is in degraded state.
oc get mcp
NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE
master rendered-master-bd4920bc82fa2273f8e79e3c851cba39 True False False 3 3 3 0 195d
worker rendered-worker-d242c91395c7a350afeeaab80b133966 False True True 5 3 3 1 195d
I will create a smaller cluster with GPU nodes only and then I will attempt the installation again. Thank you.
@fabiendupont any idea why the machineconfig got updated in this case?
I don't see an obvious reason. It could be that the MachineConfigPool node selector uses labels created by NVIDIA GPU Operator.
@ppetko, can you describe the MachineConfigPool ?