Shiva Krishna Merla
Shiva Krishna Merla
@fame346 will check with Canonical on the mismatch of these packages as they should be aligned with the driver version. Meanwhile you can install fabric manager from NVIDIA CUDA repos...
@arpitsharma-vw can you check `dmesg` on the node and report any driver errors. `dmesg | grep -i nvrm`. If you see GSP RM related errors please try [this](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/custom-driver-params.html) workaround to...
@quanguachong we do no support this configuration currently. You can make this work by installing container-toolkit packages manually on the node and disabled toolkit container with the gpu-operator. This scenario...
Can you run `kubectl get pods -n gpu-operator` to show which pods are running. If you deployed with driver enabled, it takes 3-5 minutes for the drivers to be installed...
you can disable toolkit as well by editing `kubectl edit clusterpolicy` and setting `toolkit.enabled=false`. Looks like you have nvidia-container-runtime already configured on the host and containerd config updated manually?
Can you also paste logs of `nvidia-container-toolkit-daemonset-9rvz8` pod, curious as to why it is restarting. Which containerd and OS version is this?
Thanks @denissabramovs will check these out and try to repro with 1.6.9 containerd version.
thanks @xhejtman for linking the relevant issue.
Thanks for the inputs @anoopsinghnegi we will look into avoid `containerd` restarts and `driver` unload whenever not necessary. Since the driver container bind mounts the container path `/run/nvidia/driver` onto the...
@ArangoGutierrez any thoughts?