GPU becomes unavailable after some time in Kubernetes environment

Open eason-jiang-intel opened this issue 3 years ago • 0 comments

1. Issue or feature description

We have the problem that GPUs become unavailable in a Kubernetes pod. After some time the Kubernetes pod created, we tried to execute nvidia-smi command in the pod, but got a Failed to initialize NVML: Unknown Error. error message.

2. Steps to reproduce the issue

E.g. Create a Kubernetes pod with nvidia-driver installed on a system with ubuntu 20.04 and watch -n 1 nvidia-smi inside the pod (might take minutes to several hours).

3. Information to attach (optional if deemed irrelevant)

[ ] Some nvidia-container information: nvidia-container-cli -k -d /dev/tty info
[ ] Kernel version from uname -a
[ ] Any relevant kernel output lines from dmesg
[ ] Driver information from nvidia-smi -a
[ ] Docker version from docker version
[ ] NVIDIA packages version from dpkg -l '*nvidia*' or rpm -qa '*nvidia*'
[ ] NVIDIA container library version from nvidia-container-cli -V
[ ] NVIDIA container library logs (see troubleshooting)
[ ] Docker command, image and tag used

Aug 09 '22 05:08 eason-jiang-intel