Shiva Krishna Merla

Results 278 comments of Shiva Krishna Merla

@neggert Can you attach `/var/log/messages` or logs from `journalctl -xb > journal.log`, This might help us to understand if containerd is reloaded correctly after toolkit upgrade. If it got into...

@neggert Can you set env for toolkit with `--set toolkit.env[0].name=CONTAINERD_RESTART_MODE` --set `toolkit.env[0].value=none` This will avoid containerd reloads in your case and for upgrades we don't really need a reload as...

@neggert Agree, this is just a workaround until we figure out why `containerd` reloads are causing issue in your case. Currently we don't modify the runtime config over operator/toolkit **upgrades**,...

@dasantonym to confirm, wrong driver root was set on the node nvidia-container-runtime config file? Or did you see wrong NVIDIA_DRIVER_ROOT set within the device-plugin pod?

from the error you posted, looks like device-plugin is getting started with wrong NVIDIA_DRIVER_ROOT=/run/nvidia/driver. This is set based on the file **/run/nvidia/validations/host-driver-ready** in place. ``` [/run/nvidia/driver/dev/nvidiactl](spec: failed to generate spec:...

Can you restart device-plugin on the failing node to confirm if the issue is persistent on MIG changes. Other place for driver-root setting is **/etc/nvidia-container-runtime/config.toml**. The reason i asked for...

@neggert Thanks for the detailed report. Currently we only evict/drain the node only when there are `nvidia` modules loaded and if they cannot be unloaded after evicting GPU Operator operands....

@nonpolarity looks like a CNI issue here. NFD worker pod is not able to communicate with NFD master. GPU Operator requires certain PCI labels from NFD to deploy operands. ```...

@everflux note that RuntimeClass issue is not related to this particular issue reported here as none of the components got added in the first place. But RuntimeClass issue would have...

@hoangtnm This will happen when `NVIDIA_VISIBLE_DEVICES=all` environment variable is set in the image you are using (which is true for most of the cuda images). Please refer to [this](https://docs.google.com/document/d/1uXVF-NWZQXgP1MLb87_kMkQvidpnkNWicdpO2l9g-fw/edit?usp=sharing) document...