gpu-operator error getting vGPU config: error getting all vGPU devices: unable to read MDEV devices directory: open /sys/bus/mdev/devices: no such file or directory

1. Quick Debug Information

OS/Version(e.g. RHEL8.6, Ubuntu22.04): Red Hat Enterprise Linux CoreOS release 4.11
Kernel Version: Linux 4.18.0-372.46.1.el8_6.x86_64
Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): CRI-O
K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): OCP 4.11
GPU Operator Version: 23.6.1 provided by NVIDIA Corporation

2. Issue or feature description

We can't configure the vGPUs using NVIDIA operator following the docs here https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/openshift/openshift-virtualization.html

3. Steps to reproduce the issue

Install NVIDIA operator and create ClusterPolicy with the following parameters for the vGPUs

sandboxWorloads.enabled=true
vgpuManager.enabled=true
vgpuManager.repository=<path to private repository>
vgpuManager.image=vgpu-manager
vgpuManager.version=<driver version>
vgpuManager.imagePullSecrets={<name of image pull secret>}

This is our cluster policy

apiVersion: nvidia.com/v1
kind: ClusterPolicy
metadata:
  name: gpu-cluster-policy
spec:
  vgpuDeviceManager:
    config:
      default: default
    enabled: true
  migManager:
    config:
      default: all-disabled
      name: default-mig-parted-config
    enabled: true
  operator:
    defaultRuntime: crio
    initContainer: {}
    runtimeClass: nvidia
    use_ocp_driver_toolkit: true
  dcgm:
    enabled: true
  gfd:
    enabled: true
  dcgmExporter:
    config:
      name: ''
    enabled: true
    serviceMonitor:
      enabled: true
  cdi:
    default: false
    enabled: false
  driver:
    certConfig:
      name: ''
    enabled: true
    kernelModuleConfig:
      name: ''
    licensingConfig:
      configMapName: licensing-config
      nlsEnabled: true
    repoConfig:
      configMapName: ''
    upgradePolicy:
      autoUpgrade: true
      drain:
        deleteEmptyDir: false
        enable: false
        force: false
        timeoutSeconds: 300
      maxParallelUpgrades: 1
      maxUnavailable: 25%
      podDeletion:
        deleteEmptyDir: false
        force: false
        timeoutSeconds: 300
      waitForCompletion:
        timeoutSeconds: 0
    virtualTopology:
      config: ''
  devicePlugin:
    config:
      default: ''
      name: ''
    enabled: true
  kataManager:
    config:
      artifactsDir: /opt/nvidia-gpu-operator/artifacts/runtimeclasses
  mig:
    strategy: single
  sandboxDevicePlugin:
    enabled: true
  validator:
    plugin:
      env:
        - name: WITH_WORKLOAD
          value: 'false'
  nodeStatusExporter:
    enabled: true
  daemonsets:
    rollingUpdate:
      maxUnavailable: '1'
    updateStrategy: RollingUpdate
  sandboxWorkloads:
    defaultWorkload: vm-vgpu
    enabled: true
  gds:
    enabled: false
  vgpuManager:
    driverManager:
      image: vgpu-manager
      repository: default-route-openshift-image-registry.apps.ocp4.poc.site/pp-testing
      version: 535.104.06-rhcos4.11
    enabled: true
  vfioManager:
    enabled: true
  toolkit:
    enabled: true
    installDir: /usr/local/nvidia

4. Debug info

4.1 When we specify this label nvidia.com/vgpu.config=A100-1-5C for each node

oc logs -f nvidia-vgpu-device-manager-69wm6 
Defaulted container "nvidia-vgpu-device-manager" out of: nvidia-vgpu-device-manager, vgpu-manager-validation (init)
W0928 14:49:52.314862       1 client_config.go:617] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
time="2023-09-28T14:49:52Z" level=info msg="Updating to vGPU config: A100-1-5CME"
time="2023-09-28T14:49:52Z" level=info msg="Asserting that the requested configuration is present in the configuration file"
time="2023-09-28T14:49:52Z" level=info msg="Selected vGPU device configuration is valid"
time="2023-09-28T14:49:52Z" level=info msg="Checking if the selected vGPU device configuration is currently applied or not"
time="2023-09-28T14:49:52Z" level=fatal msg="Assertion failure: selected configuration not currently applied"
time="2023-09-28T14:49:52Z" level=info msg="Getting current value of 'nvidia.com/gpu.deploy.sandbox-device-plugin' node label"
time="2023-09-28T14:49:52Z" level=info msg="Current value of 'nvidia.com/gpu.deploy.sandbox-device-plugin=true'"
time="2023-09-28T14:49:52Z" level=info msg="Getting current value of 'nvidia.com/gpu.deploy.sandbox-validator' node label"
time="2023-09-28T14:49:52Z" level=info msg="Current value of 'nvidia.com/gpu.deploy.sandbox-validator=true'"
time="2023-09-28T14:49:52Z" level=info msg="Getting current value of 'nvidia.com/vgpu.config.state' node label"
time="2023-09-28T14:49:52Z" level=info msg="Current value of 'nvidia.com/vgpu.config.state=failed'"
time="2023-09-28T14:49:52Z" level=info msg="Changing the 'nvidia.com/vgpu.config.state' node label to 'pending'"
time="2023-09-28T14:49:52Z" level=info msg="Shutting down all GPU operands in Kubernetes by disabling their component-specific nodeSelector labels"
time="2023-09-28T14:49:52Z" level=info msg="Waiting for sandbox-device-plugin to shutdown"
time="2023-09-28T14:50:23Z" level=info msg="Waiting for sandbox-validator to shutdown"
time="2023-09-28T14:50:23Z" level=info msg="Applying the selected vGPU device configuration to the node"
time="2023-09-28T14:50:23Z" level=debug msg="Parsing config file..."
time="2023-09-28T14:50:23Z" level=debug msg="Selecting specific vGPU config..."
time="2023-09-28T14:50:23Z" level=debug msg="Checking current vGPU device configuration..."
time="2023-09-28T14:50:23Z" level=debug msg="Walking VGPUConfig for (devices=all)"
time="2023-09-28T14:50:23Z" level=debug msg="  GPU 0: 0x20B510DE"
time="2023-09-28T14:50:23Z" level=info msg="Applying vGPU device configuration..."
time="2023-09-28T14:50:23Z" level=debug msg="Walking VGPUConfig for (devices=all)"
time="2023-09-28T14:50:23Z" level=debug msg="  GPU 0: 0x20B510DE"
time="2023-09-28T14:50:23Z" level=fatal msg="error getting vGPU config: error getting all vGPU devices: unable to read MDEV devices directory: open /sys/bus/mdev/devices: no such file or directory"
time="2023-09-28T14:50:23Z" level=info msg="Changing the 'nvidia.com/vgpu.config.state' node label to 'failed'"
time="2023-09-28T14:50:23Z" level=error msg="ERROR: unable to apply config 'A100-1-5CME': exit status 1"
time="2023-09-28T14:50:23Z" level=info msg="Waiting for change to 'nvidia.com/vgpu.config' label"

4.1 When we don't specify any specific gpu labels and let the nvidia operator handle the selection

oc logs -f nvidia-vgpu-device-manager-hmqjt -c vgpu-manager-validation 
waiting for NVIDIA vGPU Manager to be setup
waiting for NVIDIA vGPU Manager to be setup
waiting for NVIDIA vGPU Manager to be setup
waiting for NVIDIA vGPU Manager to be setup
waiting for NVIDIA vGPU Manager to be setup
waiting for NVIDIA vGPU Manager to be setup
waiting for NVIDIA vGPU Manager to be setup
waiting for NVIDIA vGPU Manager to be setup
^C
oc logs -f nvidia-vgpu-device-manager-q8khn  -c vgpu-manager-validation 
waiting for NVIDIA vGPU Manager to be setup
waiting for NVIDIA vGPU Manager to be setup
waiting for NVIDIA vGPU Manager to be setup
waiting for NVIDIA vGPU Manager to be setup
waiting for NVIDIA vGPU Manager to be setup
waiting for NVIDIA vGPU Manager to be setup
waiting for NVIDIA vGPU Manager to be setup
waiting for NVIDIA vGPU Manager to be setup
waiting for NVIDIA vGPU Manager to be setup
waiting for NVIDIA vGPU Manager to be setup
waiting for NVIDIA vGPU Manager to be setup
waiting for NVIDIA vGPU Manager to be setup
^C

Sep 28 '23 16:09 ppetko

@ppetko can you check logs of vgpu-manager pod to make sure if it is installed successfully?

Sep 28 '23 19:09 shivamerla

Hi @shivamerla ,

It looks like it failed.

oc logs -f nvidia-vgpu-device-manager-69wm6
Defaulted container "nvidia-vgpu-device-manager" out of: nvidia-vgpu-device-manager, vgpu-manager-validation (init)
W0928 14:49:52.314862       1 client_config.go:617] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
time="2023-09-28T14:49:52Z" level=info msg="Updating to vGPU config: A100-1-5CME"
time="2023-09-28T14:49:52Z" level=info msg="Asserting that the requested configuration is present in the configuration file"
time="2023-09-28T14:49:52Z" level=info msg="Selected vGPU device configuration is valid"
time="2023-09-28T14:49:52Z" level=info msg="Checking if the selected vGPU device configuration is currently applied or not"
time="2023-09-28T14:49:52Z" level=fatal msg="Assertion failure: selected configuration not currently applied"
time="2023-09-28T14:49:52Z" level=info msg="Getting current value of 'nvidia.com/gpu.deploy.sandbox-device-plugin' node label"
time="2023-09-28T14:49:52Z" level=info msg="Current value of 'nvidia.com/gpu.deploy.sandbox-device-plugin=true'"
time="2023-09-28T14:49:52Z" level=info msg="Getting current value of 'nvidia.com/gpu.deploy.sandbox-validator' node label"
time="2023-09-28T14:49:52Z" level=info msg="Current value of 'nvidia.com/gpu.deploy.sandbox-validator=true'"
time="2023-09-28T14:49:52Z" level=info msg="Getting current value of 'nvidia.com/vgpu.config.state' node label"
time="2023-09-28T14:49:52Z" level=info msg="Current value of 'nvidia.com/vgpu.config.state=failed'"
time="2023-09-28T14:49:52Z" level=info msg="Changing the 'nvidia.com/vgpu.config.state' node label to 'pending'"
time="2023-09-28T14:49:52Z" level=info msg="Shutting down all GPU operands in Kubernetes by disabling their component-specific nodeSelector labels"
time="2023-09-28T14:49:52Z" level=info msg="Waiting for sandbox-device-plugin to shutdown"
time="2023-09-28T14:50:23Z" level=info msg="Waiting for sandbox-validator to shutdown"
time="2023-09-28T14:50:23Z" level=info msg="Applying the selected vGPU device configuration to the node"
time="2023-09-28T14:50:23Z" level=debug msg="Parsing config file..."
time="2023-09-28T14:50:23Z" level=debug msg="Selecting specific vGPU config..."
time="2023-09-28T14:50:23Z" level=debug msg="Checking current vGPU device configuration..."
time="2023-09-28T14:50:23Z" level=debug msg="Walking VGPUConfig for (devices=all)"
time="2023-09-28T14:50:23Z" level=debug msg="  GPU 0: 0x20B510DE"
time="2023-09-28T14:50:23Z" level=info msg="Applying vGPU device configuration..."
time="2023-09-28T14:50:23Z" level=debug msg="Walking VGPUConfig for (devices=all)"
time="2023-09-28T14:50:23Z" level=debug msg="  GPU 0: 0x20B510DE"
time="2023-09-28T14:50:23Z" level=fatal msg="error getting vGPU config: error getting all vGPU devices: unable to read MDEV devices directory: open /sys/bus/mdev/devices: no such file or directory"
time="2023-09-28T14:50:23Z" level=info msg="Changing the 'nvidia.com/vgpu.config.state' node label to 'failed'"
time="2023-09-28T14:50:23Z" level=error msg="ERROR: unable to apply config 'A100-1-5CME': exit status 1"
time="2023-09-28T14:50:23Z" level=info msg="Waiting for change to 'nvidia.com/vgpu.config' label"
^C

Oct 02 '23 14:10 ppetko

@ppetko can you get logs from the vgpu-manager pod, not the vgpu-device-manager?

Oct 02 '23 14:10 cdesiniotis

@cdesiniotis there is no such pod

oc get pods 
NAME                                           READY   STATUS    RESTARTS   AGE
gpu-operator-fbb6ffcc8-gzddt                   1/1     Running   0          6d23h
nvidia-sandbox-device-plugin-daemonset-s5v5b   1/1     Running   0          4d23h
nvidia-sandbox-validator-9tmn8                 1/1     Running   0          4d23h
nvidia-vfio-manager-5j6wq                      1/1     Running   0          4d23h
nvidia-vgpu-device-manager-69wm6               1/1     Running   0          4d23h
nvidia-vgpu-device-manager-w82ds               1/1     Running   0          4d23h

This is the cluster policy I'm using

apiVersion: nvidia.com/v1
kind: ClusterPolicy
metadata:
  name: gpu-cluster-policy
spec:
  vgpuDeviceManager:
    config:
      default: default
    enabled: true
  migManager:
    config:
      default: all-disabled
      name: default-mig-parted-config
    enabled: true
  operator:
    defaultRuntime: crio
    initContainer: {}
    runtimeClass: nvidia
    use_ocp_driver_toolkit: true
  dcgm:
    enabled: true
  gfd:
    enabled: true
  dcgmExporter:
    config:
      name: ''
    enabled: true
    serviceMonitor:
      enabled: true
  cdi:
    default: false
    enabled: false
  driver:
    certConfig:
      name: ''
    enabled: true
    kernelModuleConfig:
      name: ''
    licensingConfig:
      configMapName: licensing-config
      nlsEnabled: true
    repoConfig:
      configMapName: ''
    upgradePolicy:
      autoUpgrade: true
      drain:
        deleteEmptyDir: false
        enable: false
        force: false
        timeoutSeconds: 300
      maxParallelUpgrades: 1
      maxUnavailable: 25%
      podDeletion:
        deleteEmptyDir: false
        force: false
        timeoutSeconds: 300
      waitForCompletion:
        timeoutSeconds: 0
    virtualTopology:
      config: ''
  devicePlugin:
    config:
      default: ''
      name: ''
    enabled: true
  kataManager:
    config:
      artifactsDir: /opt/nvidia-gpu-operator/artifacts/runtimeclasses
  mig:
    strategy: single
  sandboxDevicePlugin:
    enabled: true
  validator:
    plugin:
      env:
        - name: WITH_WORKLOAD
          value: 'false'
  nodeStatusExporter:
    enabled: true
  daemonsets:
    rollingUpdate:
      maxUnavailable: '1'
    updateStrategy: RollingUpdate
  sandboxWorkloads:
    defaultWorkload: vm-vgpu
    enabled: true
  gds:
    enabled: false
  vgpuManager:
    driverManager:
      image: vgpu-manager
      repository: default-route-openshift-image-registry.apps.ocp4.poc.site/pp-testing
      version: 535.104.06-rhcos4.11
    enabled: true
  vfioManager:
    enabled: true
  toolkit:
    enabled: true
    installDir: /usr/local/nvidia

Oct 03 '23 14:10 ppetko

Is vGPU manager already installed on the host (e.g. does running nvidia-smi on the host return anything)?

Can you also describe your GPU nodes? In particular I am interested in the value of this node label nvidia.com/gpu.deploy.vgpu-manager

Oct 03 '23 15:10 cdesiniotis

According to the docs, the vGPU manager should be deployed by the NVIDIA operator. In the ClusterPolicy CR I build a container image for the vGPU manager.

oc describe node gpu4 | grep vgpu-manager
                    nvidia.com/gpu.deploy.vgpu-manager=true

These are all of the nvidia labels

oc describe node gpu4 | grep nvidia.com
                    nvidia.com/gpu.deploy.cc-manager=true
                    nvidia.com/gpu.deploy.nvsm=
                    nvidia.com/gpu.deploy.sandbox-device-plugin=paused-for-vgpu-change
                    nvidia.com/gpu.deploy.sandbox-validator=paused-for-vgpu-change
                    nvidia.com/gpu.deploy.vgpu-device-manager=true
                    nvidia.com/gpu.deploy.vgpu-manager=true
                    nvidia.com/gpu.present=true
                    nvidia.com/gpu.workload.config=vm-vgpu
                    nvidia.com/mig.config=all-disabled
                    nvidia.com/mig.config.state=success
                    nvidia.com/vgpu.config=A100-2-10C
                    **nvidia.com/vgpu.config.state=failed**
  nvidia.com/A100:                 0
  nvidia.com/gpu:                  0
  nvidia.com/A100:                 0
  nvidia.com/gpu:                  0
  nvidia.com/A100                 1             1
  nvidia.com/gpu                  0             0

Oct 03 '23 17:10 ppetko

Can you oc get ds -n nvidia-gpu-operator and describe the vgpu-manager daemonset?

Oct 03 '23 17:10 shivamerla

oc get ds -n nvidia-gpu-operator 
NAME                                            DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                                                                                                         AGE
gpu-feature-discovery                           0         0         0       0            0           nvidia.com/gpu.deploy.gpu-feature-discovery=true                                                                      90s
nvidia-container-toolkit-daemonset              0         0         0       0            0           nvidia.com/gpu.deploy.container-toolkit=true                                                                          90s
nvidia-dcgm                                     0         0         0       0            0           nvidia.com/gpu.deploy.dcgm=true                                                                                       90s
nvidia-dcgm-exporter                            0         0         0       0            0           nvidia.com/gpu.deploy.dcgm-exporter=true                                                                              90s
nvidia-device-plugin-daemonset                  0         0         0       0            0           nvidia.com/gpu.deploy.device-plugin=true                                                                              90s
nvidia-driver-daemonset-411.86.202303060052-0   0         0         0       0            0           feature.node.kubernetes.io/system-os_release.OSTREE_VERSION=411.86.202303060052-0,nvidia.com/gpu.deploy.driver=true   90s
nvidia-mig-manager                              0         0         0       0            0           nvidia.com/gpu.deploy.mig-manager=true                                                                                90s
nvidia-node-status-exporter                     0         0         0       0            0           nvidia.com/gpu.deploy.node-status-exporter=true                                                                       90s
nvidia-operator-validator                       0         0         0       0            0           nvidia.com/gpu.deploy.operator-validator=true                                                                         90s
nvidia-sandbox-device-plugin-daemonset          1         1         1       1            1           nvidia.com/gpu.deploy.sandbox-device-plugin=true                                                                      90s
nvidia-sandbox-validator                        1         1         1       1            1           nvidia.com/gpu.deploy.sandbox-validator=true                                                                          90s
nvidia-vfio-manager                             1         1         1       1            1           nvidia.com/gpu.deploy.vfio-manager=true                                                                               90s
nvidia-vgpu-device-manager                      2         2         2       2            2           nvidia.com/gpu.deploy.vgpu-device-manager=true                                                                        90s

It looks like I don't have the daemonset for the vgpu-manager, that explains why I don't any pods. I have specified this label nvidia.com/vgpu.config=A100-2-10C which I'm not sure if it's the correct one. If I leave this blank I'm getting the following

oc logs -f nvidia-vgpu-device-manager-hmqjt -c vgpu-manager-validation 
waiting for NVIDIA vGPU Manager to be setup
waiting for NVIDIA vGPU Manager to be setup
waiting for NVIDIA vGPU Manager to be setup
waiting for NVIDIA vGPU Manager to be setup
waiting for NVIDIA vGPU Manager to be setup
waiting for NVIDIA vGPU Manager to be setup
waiting for NVIDIA vGPU Manager to be setup
waiting for NVIDIA vGPU Manager to be setup
^C

I have opened this case, but not much traction https://forums.developer.nvidia.com/t/rror-getting-vgpu-config-error-getting-all-vgpu-devices-unable-to-read-mdev-devices-directory-open-sys-bus-mdev-devices-no-such-file-or-directory/267696

This is the output of all resources in the namespace

oc get all 
NAME                                               READY   STATUS    RESTARTS   AGE
pod/gpu-operator-fbb6ffcc8-gzddt                   1/1     Running   0          7d2h
pod/nvidia-sandbox-device-plugin-daemonset-62rbg   1/1     Running   0          6m29s
pod/nvidia-sandbox-validator-s9zsr                 1/1     Running   0          6m29s
pod/nvidia-vfio-manager-wjx99                      1/1     Running   0          7m5s
pod/nvidia-vgpu-device-manager-g2xsd               1/1     Running   0          7m5s
pod/nvidia-vgpu-device-manager-tzpcf               1/1     Running   0          7m5s

NAME                                  TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)    AGE
service/gpu-operator                  ClusterIP   172.30.214.74   <none>        8080/TCP   7m5s
service/nvidia-dcgm-exporter          ClusterIP   172.30.37.127   <none>        9400/TCP   7m5s
service/nvidia-node-status-exporter   ClusterIP   172.30.62.146   <none>        8000/TCP   7m5s

NAME                                                           DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                                                                                                         AGE
daemonset.apps/gpu-feature-discovery                           0         0         0       0            0           nvidia.com/gpu.deploy.gpu-feature-discovery=true                                                                      7m5s
daemonset.apps/nvidia-container-toolkit-daemonset              0         0         0       0            0           nvidia.com/gpu.deploy.container-toolkit=true                                                                          7m5s
daemonset.apps/nvidia-dcgm                                     0         0         0       0            0           nvidia.com/gpu.deploy.dcgm=true                                                                                       7m5s
daemonset.apps/nvidia-dcgm-exporter                            0         0         0       0            0           nvidia.com/gpu.deploy.dcgm-exporter=true                                                                              7m5s
daemonset.apps/nvidia-device-plugin-daemonset                  0         0         0       0            0           nvidia.com/gpu.deploy.device-plugin=true                                                                              7m5s
daemonset.apps/nvidia-driver-daemonset-411.86.202303060052-0   0         0         0       0            0           feature.node.kubernetes.io/system-os_release.OSTREE_VERSION=411.86.202303060052-0,nvidia.com/gpu.deploy.driver=true   7m5s
daemonset.apps/nvidia-mig-manager                              0         0         0       0            0           nvidia.com/gpu.deploy.mig-manager=true                                                                                7m5s
daemonset.apps/nvidia-node-status-exporter                     0         0         0       0            0           nvidia.com/gpu.deploy.node-status-exporter=true                                                                       7m5s
daemonset.apps/nvidia-operator-validator                       0         0         0       0            0           nvidia.com/gpu.deploy.operator-validator=true                                                                         7m5s
daemonset.apps/nvidia-sandbox-device-plugin-daemonset          1         1         1       1            1           nvidia.com/gpu.deploy.sandbox-device-plugin=true                                                                      7m5s
daemonset.apps/nvidia-sandbox-validator                        1         1         1       1            1           nvidia.com/gpu.deploy.sandbox-validator=true                                                                          7m5s
daemonset.apps/nvidia-vfio-manager                             1         1         1       1            1           nvidia.com/gpu.deploy.vfio-manager=true                                                                               7m5s
daemonset.apps/nvidia-vgpu-device-manager                      2         2         2       2            2           nvidia.com/gpu.deploy.vgpu-device-manager=true                                                                        7m5s

NAME                           READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/gpu-operator   1/1     1            1           7d2h

NAME                                     DESIRED   CURRENT   READY   AGE
replicaset.apps/gpu-operator-fbb6ffcc8   1         1         1       7d2h

Oct 03 '23 18:10 ppetko

This doesn't seem right, if the node is labelled as nvidia.com/gpu.workload.config=vm-vgpu then we deploy both "vgpu-manager" and "vgpu-device-manager". Here we see vfio-manager getting deployed that happens only when the workload-config is vm-passthrough. If you can share operator logs we can check why right operands are not getting deployed.

Oct 03 '23 18:10 shivamerla

Ah, below section is wrong.

 vgpuManager:
    driverManager:
      image: vgpu-manager
      repository: default-route-openshift-image-registry.apps.ocp4.poc.site/pp-testing
      version: 535.104.06-rhcos4.11
    enabled: true

This should be

vgpuManager:
  enabled: true
  repository: "default-route-openshift-image-registry.apps.ocp4.poc.site/pp-testing"
  image: vgpu-manager
  version: "535.104.06-rhcos4.11"
  imagePullPolicy: IfNotPresent
  imagePullSecrets: []
  env: []
  resources: {}

Oct 03 '23 18:10 shivamerla

Hm interesting - this yaml was generated by the clusterpolicy install using the UI.

Look at the logs below... Let me redeploy with the correct yaml file.


{"level":"error","ts":"2023-10-03T18:10:01Z","logger":"controllers.ClusterPolicy","msg":"Failed to apply transformation","Daemonset":"nvidia-vgpu-manager-daemonset","resource":"nvidia-vgpu-manager-daemonset","error":"failed to transform vGPU Manager container: Empty image path provided through both ClusterPolicy CR and ENV VGPU_MANAGER_IMAGE"}
{"level":"info","ts":"2023-10-03T18:10:01Z","logger":"controllers.ClusterPolicy","msg":"Could not pre-process","DaemonSet":"nvidia-vgpu-manager-daemonset","Namespace":"nvidia-gpu-operator","Error":"failed to transform vGPU Manager container: Empty image path provided through both ClusterPolicy CR and ENV VGPU_MANAGER_IMAGE"}
{"level":"error","ts":"2023-10-03T18:10:01Z","msg":"Reconciler error","controller":"clusterpolicy-controller","object":{"name":"gpu-cluster-policy"},"namespace":"","name":"gpu-cluster-policy","reconcileID":"62d09b2d-b745-4df4-bf74-dda2fd3c7cf2","error":"failed to handle OpenShift Driver Toolkit Daemonset for version 411.86.202303060052-0: failed to transform vGPU Manager container: Empty image path provided through both ClusterPolicy CR and ENV VGPU_MANAGER_IMAGE"}
{"level":"error","ts":"2023-10-03T18:10:01Z","logger":"controllers.ClusterPolicy","msg":"Failed to apply transformation","Daemonset":"nvidia-vgpu-manager-daemonset","resource":"nvidia-vgpu-manager-daemonset","error":"failed to transform vGPU Manager container: Empty image path provided through both ClusterPolicy CR and ENV VGPU_MANAGER_IMAGE"}
{"level":"info","ts":"2023-10-03T18:10:01Z","logger":"controllers.ClusterPolicy","msg":"Could not pre-process","DaemonSet":"nvidia-vgpu-manager-daemonset","Namespace":"nvidia-gpu-operator","Error":"failed to transform vGPU Manager container: Empty image path provided through both ClusterPolicy CR and ENV VGPU_MANAGER_IMAGE"}
{"level":"error","ts":"2023-10-03T18:10:01Z","msg":"Reconciler error","controller":"clusterpolicy-controller","object":{"name":"gpu-cluster-policy"},"namespace":"","name":"gpu-cluster-policy","reconcileID":"62d09b2d-b745-4df4-bf74-dda2fd3c7cf2","error":"failed to handle OpenShift Driver Toolkit Daemonset for version 411.86.202303060052-0: failed to transform vGPU Manager container: Empty image path provided through both ClusterPolicy CR and ENV VGPU_MANAGER_IMAGE"}

Oct 03 '23 18:10 ppetko

A little heads up in the docs would be nice that once you deploy the clusterpolicy, the operator will roll the cluster and restart each node. I see 2 new machine configs are applied and the cluster is trying to update. The problem it's stuck on a node that doesn't have a GPU. I have already loaded the kernel parameters for the GPUs using a machine config only for the nodes that contains a GPU.

What exactly are the machine configs trying to configure? Are there any docs on this process?

The kernel modules are already loaded

oc debug node/gpu1  -- chroot /host lspci -nnk -d 10de:  
Starting pod/gpu1ocp4pocsite-debug ...
To use host binaries, run `chroot /host`
0000:31:00.0 VGA compatible controller [0300]: NVIDIA Corporation GA102GL [RTX A5000] [10de:2231] (rev a1)
	Subsystem: NVIDIA Corporation Device [10de:147e]
	Kernel driver in use: nvidia
	Kernel modules: nouveau

oc debug node/gpu3 -- chroot /host lspci -nnk -d 10de:  
Starting pod/gpu3ocp4pocsite-debug ...
To use host binaries, run `chroot /host`
1b:00.0 3D controller [0302]: NVIDIA Corporation GA100 [A100 PCIe 80GB] [10de:20b5] (rev a1)
	Subsystem: NVIDIA Corporation Device [10de:1533]
	Kernel driver in use: nvidia
	Kernel modules: nouveau
1c:00.0 3D controller [0302]: NVIDIA Corporation GA100 [A100 PCIe 80GB] [10de:20b5] (rev a1)
	Subsystem: NVIDIA Corporation Device [10de:1533]
	Kernel driver in use: nvidia
	Kernel modules: nouveau

Output of the mcp

oc get mcp 
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
master   rendered-master-bd4920bc82fa2273f8e79e3c851cba39   True      False      False      3              3                   3                     0                      194d
worker   rendered-worker-d242c91395c7a350afeeaab80b133966   False     True       True       5              3                   3                     1                      194d

On the bright side, I think the deployment is fixed. I checked the clusterpolicy UI is generating a wrong clusterpolicy using the UI in version 23.6.1 provided by NVIDIA Corporation.

oc get pods 
NAME                                                        READY   STATUS    RESTARTS   AGE
gpu-operator-fbb6ffcc8-qdd5g                                1/1     Running   0          48m
nvidia-sandbox-device-plugin-daemonset-66nzb                1/1     Running   0          40m
nvidia-sandbox-validator-v76vz                              1/1     Running   0          40m
nvidia-vfio-manager-lxqzb                                   1/1     Running   0          41m
nvidia-vgpu-device-manager-44sxj                            1/1     Running   0          14m
nvidia-vgpu-device-manager-qpdlx                            1/1     Running   0          14m
nvidia-vgpu-manager-daemonset-411.86.202303060052-0-k52dq   2/2     Running   0          14m
nvidia-vgpu-manager-daemonset-411.86.202303060052-0-pfxgp   2/2     Running   0          14m

oc logs -f nvidia-vgpu-manager-daemonset-411.86.202303060052-0-k52dq 
\Defaulted container "nvidia-vgpu-manager-ctr" out of: nvidia-vgpu-manager-ctr, openshift-driver-toolkit-ctr, k8s-driver-manager (init)
+ [[ '' == \t\r\u\e ]]
+ [[ ! -f /mnt/shared-nvidia-driver-toolkit/dir_prepared ]]
+ cp -r /usr/local/bin/ocp_dtk_entrypoint /usr/local/bin/nvidia-driver /driver /mnt/shared-nvidia-driver-toolkit/
+ env
+ sed 's/=/="/'
+ sed 's/$/"/'
+ touch /mnt/shared-nvidia-driver-toolkit/dir_prepared
+ set +x
Tue Oct  3 19:35:25 UTC 2023 Waiting for openshift-driver-toolkit-ctr container to start ...
Tue Oct  3 19:35:40 UTC 2023 openshift-driver-toolkit-ctr started.
+ sleep infinity

Oct 03 '23 20:10 ppetko

@ppetko AFAIK, we don't update MachineConfig at all from our code. What is the actual change that is being applied through MachineConfig? May be some other operator(OSV?) triggered that?

Oct 04 '23 04:10 shivamerla

From what I can see, as soon as we applied the correct ClusterPolicy CR, two new machine configs were created. But the configurations doesn't look related to the GPUs. So not sure what caused this.

oc get mc 
NAME                                                    GENERATEDBYCONTROLLER                      IGNITIONVERSION   AGE
00-master                                               624a49edf1d0eeca83d70c58faae25516fa25e20   3.2.0             195d
00-worker                                               624a49edf1d0eeca83d70c58faae25516fa25e20   3.2.0             195d
01-master-container-runtime                             624a49edf1d0eeca83d70c58faae25516fa25e20   3.2.0             195d
01-master-kubelet                                       624a49edf1d0eeca83d70c58faae25516fa25e20   3.2.0             195d
01-worker-container-runtime                             624a49edf1d0eeca83d70c58faae25516fa25e20   3.2.0             195d
01-worker-kubelet                                       624a49edf1d0eeca83d70c58faae25516fa25e20   3.2.0             195d
100-worker-iommu                                                                                   3.2.0             194d
100-worker-vfiopci                                                                                 3.2.0             194d
50-masters-chrony-configuration                                                                    3.1.0             195d
50-workers-chrony-configuration                                                                    3.1.0             195d
99-assisted-installer-master-ssh                                                                   3.1.0             195d
99-master-generated-crio-add-inheritable-capabilities                                              3.2.0             195d
99-master-generated-registries                          624a49edf1d0eeca83d70c58faae25516fa25e20   3.2.0             195d
99-master-ssh                                                                                      3.2.0             195d
99-worker-generated-crio-add-inheritable-capabilities                                              3.2.0             195d
99-worker-generated-registries                          624a49edf1d0eeca83d70c58faae25516fa25e20   3.2.0             195d
99-worker-ssh                                                                                      3.2.0             195d
rendered-master-4601510310247f17c4b2ee3ada9ca54f        624a49edf1d0eeca83d70c58faae25516fa25e20   3.2.0             195d
rendered-master-bd4920bc82fa2273f8e79e3c851cba39        624a49edf1d0eeca83d70c58faae25516fa25e20   3.2.0             19h
rendered-worker-06a98033c5f02d42ff75208c7b1db70c        624a49edf1d0eeca83d70c58faae25516fa25e20   3.2.0             19h
rendered-worker-20a3cea1f4b3d262015faf2610a652e1        624a49edf1d0eeca83d70c58faae25516fa25e20   3.2.0             195d
rendered-worker-a915ba541d8df2a6741b2f8507ea3928        624a49edf1d0eeca83d70c58faae25516fa25e20   3.2.0             194d
rendered-worker-d242c91395c7a350afeeaab80b133966        624a49edf1d0eeca83d70c58faae25516fa25e20   3.2.0             194d

Now the worker machine config pool is in degraded state.

oc get mcp 
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
master   rendered-master-bd4920bc82fa2273f8e79e3c851cba39   True      False      False      3              3                   3                     0                      195d
worker   rendered-worker-d242c91395c7a350afeeaab80b133966   False     True       True       5              3                   3                     1                      195d

I will create a smaller cluster with GPU nodes only and then I will attempt the installation again. Thank you.

Oct 04 '23 14:10 ppetko

@fabiendupont any idea why the machineconfig got updated in this case?

Oct 19 '23 05:10 shivamerla

I don't see an obvious reason. It could be that the MachineConfigPool node selector uses labels created by NVIDIA GPU Operator.

@ppetko, can you describe the MachineConfigPool ?

Oct 19 '23 05:10 fabiendupont

gpu-operator gpu-operator copied to clipboard

error getting vGPU config: error getting all vGPU devices: unable to read MDEV devices directory: open /sys/bus/mdev/devices: no such file or directory

1. Quick Debug Information

2. Issue or feature description

3. Steps to reproduce the issue

4. Debug info

This is the output of all resources in the namespace

The kernel modules are already loaded

Output of the mcp

gpu-operator
gpu-operator copied to clipboard