nvidia-docker
nvidia-docker copied to clipboard
GPU becomes unavailable after some time in Kubernetes environment
1. Issue or feature description
GPU becomes unavailable after some time in Kubernetes environment
We have the problem that GPUs become unavailable in a Kubernetes pod. After some time the Kubernetes pod created, we tried to execute nvidia-smi command in the pod, but got a Failed to initialize NVML: Unknown Error. error message.
2. Steps to reproduce the issue
E.g. Create a Kubernetes pod with nvidia-driver installed on a system with ubuntu 20.04 and watch -n 1 nvidia-smi inside the pod (might take minutes to several hours).
3. Information to attach (optional if deemed irrelevant)
- [ ] Some nvidia-container information:
nvidia-container-cli -k -d /dev/tty info - [ ] Kernel version from
uname -a - [ ] Any relevant kernel output lines from
dmesg - [ ] Driver information from
nvidia-smi -a - [ ] Docker version from
docker version - [ ] NVIDIA packages version from
dpkg -l '*nvidia*'orrpm -qa '*nvidia*' - [ ] NVIDIA container library version from
nvidia-container-cli -V - [ ] NVIDIA container library logs (see troubleshooting)
- [ ] Docker command, image and tag used