nvidia-docker icon indicating copy to clipboard operation
nvidia-docker copied to clipboard

GPU becomes unavailable after some time in Kubernetes environment

Open eason-jiang-intel opened this issue 3 years ago • 0 comments

1. Issue or feature description

GPU becomes unavailable after some time in Kubernetes environment

We have the problem that GPUs become unavailable in a Kubernetes pod. After some time the Kubernetes pod created, we tried to execute nvidia-smi command in the pod, but got a Failed to initialize NVML: Unknown Error. error message.

2. Steps to reproduce the issue

E.g. Create a Kubernetes pod with nvidia-driver installed on a system with ubuntu 20.04 and watch -n 1 nvidia-smi inside the pod (might take minutes to several hours).

3. Information to attach (optional if deemed irrelevant)

  • [ ] Some nvidia-container information: nvidia-container-cli -k -d /dev/tty info
  • [ ] Kernel version from uname -a
  • [ ] Any relevant kernel output lines from dmesg
  • [ ] Driver information from nvidia-smi -a
  • [ ] Docker version from docker version
  • [ ] NVIDIA packages version from dpkg -l '*nvidia*' or rpm -qa '*nvidia*'
  • [ ] NVIDIA container library version from nvidia-container-cli -V
  • [ ] NVIDIA container library logs (see troubleshooting)
  • [ ] Docker command, image and tag used

eason-jiang-intel avatar Aug 09 '22 05:08 eason-jiang-intel