ml
ml copied to clipboard
Notes on upgrading cuda version on host
After upgrading the nvidia drivers on the host (e.g. with apt-get upgrade
), nvidia
tasks will fail to run due to driver mistmatch, e.g.:
$ nvidia-smi
Failed to initialize NVML: Driver/library version mismatch
Rebooting the machine resolves this, but if that is not a convenient option for a server, this can be done manually: stop any running Xorg instances (e.g. sudo service stop gdm3
) and then sudo rmmod nvidia
. The latter may fail by listing submodules that are still running, so stop these as well, sudo rmmod nvidia-uvm
. Then restart the nvidia drivers with sudo nvidia-smi
to confirm GPU is back and running.
see:
- https://forums.developer.nvidia.com/t/reset-driver-without-rebooting-on-linux/40625/2
- https://forums.developer.nvidia.com/t/cant-install-new-driver-cannot-unload-module/63639
Running nvidia-docker instances, e.g. with docker run --gpus all ...
should now work again as before. Should add this to the user docs when we get to writing down more stuff about CUDA images...