ppetko comments

Results 9 comments of


                                            ppetko

error getting vGPU config: error getting all vGPU devices: unable to read MDEV devices directory: open /sys/bus/mdev/devices: no such file or directory

Hi @shivamerla , It looks like it failed. ``` oc logs -f nvidia-vgpu-device-manager-69wm6 Defaulted container "nvidia-vgpu-device-manager" out of: nvidia-vgpu-device-manager, vgpu-manager-validation (init) W0928 14:49:52.314862 1 client_config.go:617] Neither --kubeconfig nor --master was...

error getting vGPU config: error getting all vGPU devices: unable to read MDEV devices directory: open /sys/bus/mdev/devices: no such file or directory

@cdesiniotis there is no such pod ``` oc get pods NAME READY STATUS RESTARTS AGE gpu-operator-fbb6ffcc8-gzddt 1/1 Running 0 6d23h nvidia-sandbox-device-plugin-daemonset-s5v5b 1/1 Running 0 4d23h nvidia-sandbox-validator-9tmn8 1/1 Running 0 4d23h...

error getting vGPU config: error getting all vGPU devices: unable to read MDEV devices directory: open /sys/bus/mdev/devices: no such file or directory

According to the docs, the vGPU manager should be deployed by the NVIDIA operator. In the `ClusterPolicy` CR I build a container image for the vGPU manager. ``` oc describe...

error getting vGPU config: error getting all vGPU devices: unable to read MDEV devices directory: open /sys/bus/mdev/devices: no such file or directory

``` oc get ds -n nvidia-gpu-operator NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE gpu-feature-discovery 0 0 0 0 0 nvidia.com/gpu.deploy.gpu-feature-discovery=true 90s nvidia-container-toolkit-daemonset 0 0 0 0 0 nvidia.com/gpu.deploy.container-toolkit=true...

error getting vGPU config: error getting all vGPU devices: unable to read MDEV devices directory: open /sys/bus/mdev/devices: no such file or directory

Hm interesting - this yaml was generated by the clusterpolicy install using the UI. Look at the logs below... Let me redeploy with the correct yaml file. ``` {"level":"error","ts":"2023-10-03T18:10:01Z","logger":"controllers.ClusterPolicy","msg":"Failed to...

error getting vGPU config: error getting all vGPU devices: unable to read MDEV devices directory: open /sys/bus/mdev/devices: no such file or directory

A little heads up in the docs would be nice that once you deploy the clusterpolicy, the operator will roll the cluster and restart each node. I see 2 new...

error getting vGPU config: error getting all vGPU devices: unable to read MDEV devices directory: open /sys/bus/mdev/devices: no such file or directory

From what I can see, as soon as we applied the correct `ClusterPolicy` CR, two new machine configs were created. But the configurations doesn't look related to the GPUs. So...

Disconnected Operator Install

I can help out on this issue as well.

Install CaC on disconnected cluster

Any updates on this? Is this related to the compliance operator?