nvidia-docker icon indicating copy to clipboard operation
nvidia-docker copied to clipboard

nvidia-smi command in container returns "Failed to initialize NVML: Unknown Error" after couple of times

Open yuan6711043 opened this issue 3 years ago • 6 comments

1. Issue or feature description

Nvidia gpu works well upon the container has started, but when it runs a couple of times(maybe several days), gpus mounted by nvidia container runtime becomes invalid. Command Nvidia-smi returns "Failed to initialize NVML: Unknown Error" in container, while it works well on the host machine.

image

Nvidia-smi looks well on host,and we can see the training process information through host nvidia-smi command output. If now we stop the training process, it can no longer restart.

image

Referring to the solution from issue https://github.com/NVIDIA/nvidia-docker/issues/1618 . We try to upgrade cgroup to v2 version, but it does not work.

image

Surprising, we cannot find any devices.list files in the container,which is mentioned in https://github.com/NVIDIA/nvidia-docker/issues/1618

image

2. Steps to reproduce the issue

We find this issue can be reproduced when running "systemctl daemon-reload" on host,but actually we have not run any similar commands in our production environment

image

Can anyone give some good ideas for positioning this problem

3. Information to attach (optional if deemed irrelevant)

docker: 20.10.7

k8s: v1.22.5

nvidia driver version: 470.103.01

nvidia-container-runtime: 3.8.1-1

containerd: 1.5.5-0ubuntu3~20.04.2

yuan6711043 avatar Sep 14 '22 04:09 yuan6711043

I've noticed the same behavior for some time on Debian 11; at least since March as that is when I started regularly checking for nvidia-smi functioning in containers, and thanks for calling out systemctl daemon-reload as something that triggers it. In my case, I have automatic updates enabled in Debian using unattended upgrades and your mention of daemon-reload makes me think that the package updates may be triggering a daemon-reload event to occur. I'm only updating packages from the Debian repos automatically, applying nvidia-docker and 3rd party repo updates manually.

Example from today where I can see the auto updates happening where in this case, telegraf is being updated and then a daemon-reload occurs, or at least that is what I believe I am seeing from systemd[1]: Reloading. based on the output when I manually run a systemctl daemon-reload:

Sep 14 06:46:17 athena systemd[1]: Starting Daily apt upgrade and clean activities...
Sep 14 06:46:50 athena systemd[1]: Stopping The plugin-driven server agent for reporting metrics into InfluxDB...
Sep 14 06:46:52 athena telegraf[6829]: 2022-09-14T10:46:52Z I! [agent] Hang on, flushing any cached metrics before shutdown
Sep 14 06:46:52 athena telegraf[6829]: 2022-09-14T10:46:52Z I! [agent] Stopping running outputs
Sep 14 06:46:52 athena systemd[1]: telegraf.service: Succeeded.
Sep 14 06:46:52 athena systemd[1]: Stopped The plugin-driven server agent for reporting metrics into InfluxDB.
Sep 14 06:46:52 athena systemd[1]: telegraf.service: Consumed 14h 46min 50.974s CPU time.
Sep 14 06:46:54 athena systemd[1]: Reloading.
Sep 14 06:46:54 athena systemd[1]: /lib/systemd/system/plymouth-start.service:16: Unit configured to use KillMode=none. This is unsafe, as it disables systemd's process lifecycle management for the service. Please update your service to use a safer KillMode=, such as 'mixed' or 'control-group'. Support for KillMode=none is deprecated and will eventually be removed.
Sep 14 06:46:54 athena systemd[1]: nut-monitor.service: Supervising process 7162 which is not our child. We'll most likely not notice when it exits.
Sep 14 06:46:54 athena systemd[1]: Reloading.
Sep 14 06:46:55 athena systemd[1]: /lib/systemd/system/plymouth-start.service:16: Unit configured to use KillMode=none. This is unsafe, as it disables systemd's process lifecycle management for the service. Please update your service to use a safer KillMode=, such as 'mixed' or 'control-group'. Support for KillMode=none is deprecated and will eventually be removed.
Sep 14 06:46:55 athena systemd[1]: nut-monitor.service: Supervising process 7162 which is not our child. We'll most likely not notice when it exits.
Sep 14 06:46:55 athena systemd[1]: Starting The plugin-driven server agent for reporting metrics into InfluxDB...
Sep 14 06:46:55 athena telegraf[4068755]: 2022-09-14T10:46:55Z W! DeprecationWarning: Option "perdevice" of plugin "inputs.docker" deprecated since version 1.18.0 and will be removed in 2.0.0: use 'perdevice_include' instead
Sep 14 06:46:55 athena telegraf[4068755]: 2022-09-14T10:46:55Z I! Starting Telegraf 1.24.0
Sep 14 06:46:55 athena telegraf[4068755]: 2022-09-14T10:46:55Z I! Available plugins: 222 inputs, 9 aggregators, 26 processors, 20 parsers, 57 outputs
Sep 14 06:46:55 athena telegraf[4068755]: 2022-09-14T10:46:55Z I! Loaded inputs: cpu disk diskio docker exec file ipmi_sensor kernel mem net netstat nvidia_smi processes smart swap system zfs
Sep 14 06:46:55 athena telegraf[4068755]: 2022-09-14T10:46:55Z I! Loaded aggregators:
Sep 14 06:46:55 athena telegraf[4068755]: 2022-09-14T10:46:55Z I! Loaded processors:
Sep 14 06:46:55 athena telegraf[4068755]: 2022-09-14T10:46:55Z I! Loaded outputs: influxdb_v2
Sep 14 06:46:55 athena telegraf[4068755]: 2022-09-14T10:46:55Z I! Tags enabled: host=athena
Sep 14 06:46:55 athena telegraf[4068755]: 2022-09-14T10:46:55Z W! Deprecated inputs: 0 and 1 options
Sep 14 06:46:55 athena telegraf[4068755]: 2022-09-14T10:46:55Z I! [agent] Config: Interval:10s, Quiet:false, Hostname:"athena", Flush Interval:10s
Sep 14 06:46:55 athena systemd[1]: Started The plugin-driven server agent for reporting metrics into InfluxDB.
Sep 14 06:47:14 athena systemd[1]: apt-daily-upgrade.service: Succeeded.
Sep 14 06:47:14 athena systemd[1]: Finished Daily apt upgrade and clean activities.
Sep 14 06:47:14 athena systemd[1]: apt-daily-upgrade.service: Consumed 58.400s CPU time.
Sep 14 06:47:29 athena systemd[1]: Starting Cleanup of Temporary Directories...
Sep 14 06:47:29 athena systemd[1]: systemd-tmpfiles-clean.service: Succeeded.
Sep 14 06:47:29 athena systemd[1]: Finished Cleanup of Temporary Directories.
Sep 14 06:50:00 athena systemd[1]: Starting system activity accounting tool...
Sep 14 06:50:00 athena systemd[1]: sysstat-collect.service: Succeeded.
Sep 14 06:50:00 athena systemd[1]: Finished system activity accounting tool.

mbentley avatar Sep 14 '22 14:09 mbentley

Do the packages in the debian repositories include the NVIDIA Drivers?

elezar avatar Sep 14 '22 14:09 elezar

Fair point and callout - they do. I was looking back, doing a cross-check of the times I detected the issue & sending myself a notification and the packages that were upgraded at the time, I recorded the following:

2022-09-14:

google-chrome-stable:amd64
telegraf:amd64
handbrake-cli:amd64
handbrake:amd64
handbrake-gtk:amd64

2022-08-17:

epiphany-browser-data:amd64
libjavascriptcoregtk-4.0-18:amd64
libsnmp40:amd64
libsnmp-base:amd64
telegraf:amd64
google-chrome-stable:amd64
epiphany-browser:amd64

2022-08-13:

python3-samba:amd64
libldb2:amd64
samba-vfs-modules:amd64
samba:amd64
libwbclient0:amd64
libsmbclient:amd64
samba-dsdb-modules:amd64
samba-common-bin:amd64
python3-ldb:amd64
samba-libs:amd64
samba-common:amd64

2022-07-27:

linux-kbuild-5.10:amd64
linux-compiler-gcc-10-x86:amd64
telegraf:amd64
linux-libc-dev:amd64
libcpupower1:amd64

2022-07-13:

telegraf:amd64

2022-07-06:

google-chrome-stable:amd64
telegraf:amd64

2022-05-29:

rsyslog:amd64

2022-05-20:

libldap-common:amd64
ldap-utils:amd64
libldap-2.4-2:amd64
libldap-2.4-2:i386

2022-05-17:

telegraf:amd64

2022-04-29:

telegraf:amd64

2022-04-27:

telegraf:amd64

2022-04-20:

libnvpair3linux:amd64
libuutil3linux:amd64
zfs-dkms:amd64
libzpool5linux:amd64
libzfs4linux:amd64
zfsutils-linux:amd64

While I see telegraf frequently, it's not consistent. I may just be reading into it too much based on the daemon-reload behavior but in almost every case, I can see where a package was upgraded that does have a system unit which I would expect is triggering a daemon-reload to deal with the update. Unfortunately I do not have syslog logs from that far back to match that in all cases but I can see that it doesn't seem to correspond to driver package updates.

mbentley avatar Sep 14 '22 15:09 mbentley

I'm encountering the same issue. I'm currently testing some solutions proposed in NVIDIA/nvidia-container-toolkit#251 and NVIDIA/nvidia-docker#1671 and I will let you know if something works for me.

iFede94 avatar Sep 15 '22 10:09 iFede94

@mbentley thanks for reminding,I will check if there are any auto upgrade packages in our production environments

yuan6711043 avatar Sep 15 '22 11:09 yuan6711043

At least in my case where telegraf is a big culprit, I can see that in the post install script it does call a systemctl daemon-reload which matches the behavior I've been seeing.

Same for rsyslog (not sure where the Debian packaging is source code wise but here is the postinst script).

So far, I haven't seen any instances where driver upgrades have impacted running containers but I've only seen one instance where the drivers were updated on 9/12 so there is only a sample size of one to go on from my logs. It would be easy enough to add the nvidia-drivers to the package blacklist if it was causing an issue but at least from the best of what I can tell, that does not seem to be the trigger.

mbentley avatar Sep 20 '22 11:09 mbentley

We recently had to solve this for runc interactive issue. E.g.:

  • https://github.com/opencontainers/runc/issues/3551#issuecomment-1211427622
  • https://github.com/k3s-io/k3s/issues/6064

We only just realised we're hitting this now for GPUs dropping out in containers too.

dcarrion87 avatar Mar 21 '23 02:03 dcarrion87

I am closing this as a duplicate of https://github.com/NVIDIA/nvidia-container-toolkit/issues/48 -- a known issue with certain runc / systemd version combinations. Please see the steps to address this there or create a new issue against https://github.com/NVIDIA/nvidia-container-toolkit if you are still having problems.

elezar avatar Nov 27 '23 11:11 elezar