Shiva Krishna Merla

@NVIDIA San Jose, CA

Results 278 comments of


                                            Shiva Krishna Merla

nvidia-fabricmanager.service can not start due to CUDA Version mismatch

@fame346 will check with Canonical on the mismatch of these packages as they should be aligned with the driver version. Meanwhile you can install fabric manager from NVIDIA CUDA repos...

Unable to run pod on G5 48xlarge instance, other g5 instance works well

@arpitsharma-vw can you check `dmesg` on the node and report any driver errors. `dmesg | grep -i nvrm`. If you see GSP RM related errors please try [this](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/custom-driver-params.html) workaround to...

When there are nodes with containerd runtime in the cluster, it can cause nodes running on the Docker runtime to break down

@quanguachong we do no support this configuration currently. You can make this work by installing container-toolkit packages manually on the node and disabled toolkit container with the gpu-operator. This scenario...

Failed to get sandbox runtime: no runtime for nvidia is configured

Can you run `kubectl get pods -n gpu-operator` to show which pods are running. If you deployed with driver enabled, it takes 3-5 minutes for the drivers to be installed...

Failed to get sandbox runtime: no runtime for nvidia is configured

you can disable toolkit as well by editing `kubectl edit clusterpolicy` and setting `toolkit.enabled=false`. Looks like you have nvidia-container-runtime already configured on the host and containerd config updated manually?

Failed to get sandbox runtime: no runtime for nvidia is configured

Can you also paste logs of `nvidia-container-toolkit-daemonset-9rvz8` pod, curious as to why it is restarting. Which containerd and OS version is this?

Failed to get sandbox runtime: no runtime for nvidia is configured

Thanks @denissabramovs will check these out and try to repro with 1.6.9 containerd version.

Failed to get sandbox runtime: no runtime for nvidia is configured

thanks @xhejtman for linking the relevant issue.

containerd getting restarted on gpu node reboots

Thanks for the inputs @anoopsinghnegi we will look into avoid `containerd` restarts and `driver` unload whenever not necessary. Since the driver container bind mounts the container path `/run/nvidia/driver` onto the...

GPU-operator install fails - NFD master pod crash , Probes are failing

@ArangoGutierrez any thoughts?

‹
1
2
...
16
17
18
19
20
21
22
...
27
28
›