nvidia-docker
nvidia-docker copied to clipboard
Upgrade NVidia driver without the need to restart docker daemon
1. Issue or feature description
Need to upgrade NVidia driver for host and containers without restarting docker daemon.
This applies on containers consuming GPU capabilities with Docker Nvidia Runtime.
After the host Nvidia driver is updated, and while any container using GPU is stopped, when trying to start the containers again the start fails with following error messages.
stderr: nvidia-container-cli: mount error: mount operation failed: /var/lib/docker/overlay2/307038625cd555791f9de4ea47596a7a5815ca21a7e9b6a783368637a2fb24cd/merged/proc/driver/nvidia/params/version/registry: no such file or directory: unknown"
If we restart the whole docker daemon, the container gets back online properly.
2. Steps to reproduce the issue
First make sure you are using a previous driver version, and have at least one docker container consuming GPU (eg nvidia-smi from within container)
Requirements Docker Nvidia Docker installed Nvidia Driver installed Working nvidia-smi command
docker run --name=nvidia-test --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=all nginx nvidia-smi
Nvidia-smi should be printing GPU information marking the test as passed.
Now let's reproduce the behavior in question
# stop container
nvidia-test
# unload kernel module
modprobe -r nvidia_drm
#download and install any higher linux driver
./Driver.run
# then restart container
docker start nvidia-test
You should get a similar to
stderr: nvidia-container-cli: mount error: mount operation failed: /var/lib/docker/overlay2/307038625cd555791f9de4ea47596a7a5815ca21a7e9b6a783368637a2fb24cd/merged/proc/driver/nvidia/params/version/registry: no such file or directory: unknown"
error message.
If you restart docker deamon
systemctl restart docker
Then the container can be brought back online
docker start nvidia-test
Information to attach (optional if deemed irrelevant)
- [ ] Some nvidia-container information:
NVRM version: 510.68.02
CUDA version: 11.6
Device Index: 0
Device Minor: 0
Model: NVIDIA GeForce GTX 1650
Brand: GeForce
GPU UUID: GPU-7878ba12-9b30-8f49-3da8-7930824af120
Bus Location: 00000000:82:00.0
- [ ] Kernel Version
Linux 3.10.0-1160.11.1.el7.x86_64 #1 SMP Fri Dec 18 16:34:56 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
- [ ] Driver information from
nvidia-smi -a
$nvidia-smi -a
==============NVSMI LOG==============
Timestamp : Wed May 4 10:57:14 2022
Driver Version : 510.68.02
CUDA Version : 11.6
Attached GPUs : 1
GPU 00000000:82:00.0
Product Name : NVIDIA GeForce GTX 1650
Product Brand : GeForce
Product Architecture : Turing
Display Mode : Disabled
Display Active : Disabled
Persistence Mode : Disabled
MIG Mode
Current : N/A
Pending : N/A
Accounting Mode : Disabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Serial Number : N/A
GPU UUID : GPU-7878ba12-9b30-8f49-3da8-7930824af120
Minor Number : 0
VBIOS Version : 90.17.3D.00.4E
MultiGPU Board : No
Board ID : 0x8200
GPU Part Number : N/A
Module ID : 0
Inforom Version
Image Version : G001.0000.02.04
OEM Object : 1.1
ECC Object : N/A
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GSP Firmware Version : N/A
GPU Virtualization Mode
Virtualization Mode : None
Host VGPU Mode : N/A
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0x82
Device : 0x00
Domain : 0x0000
Device Id : 0x1F8210DE
Bus Id : 00000000:82:00.0
Sub System Id : 0x8D921462
GPU Link Info
PCIe Generation
Max : 3
Current : 3
Link Width
Max : 16x
Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A
Replays Since Reset : 0
Replay Number Rollovers : 0
Tx Throughput : 0 KB/s
Rx Throughput : 0 KB/s
Fan Speed : 40 %
Performance State : P0
Clocks Throttle Reasons
Idle : Not Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
FB Memory Usage
Total : 4096 MiB
Reserved : 184 MiB
Used : 0 MiB
Free : 3911 MiB
BAR1 Memory Usage
Total : 256 MiB
Used : 2 MiB
Free : 254 MiB
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 0 %
Encoder : 0 %
Decoder : 0 %
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
FBC Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
Ecc Mode
Current : N/A
Pending : N/A
ECC Errors
Volatile
SRAM Correctable : N/A
SRAM Uncorrectable : N/A
DRAM Correctable : N/A
DRAM Uncorrectable : N/A
Aggregate
SRAM Correctable : N/A
SRAM Uncorrectable : N/A
DRAM Correctable : N/A
DRAM Uncorrectable : N/A
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending Page Blacklist : N/A
Remapped Rows : N/A
Temperature
GPU Current Temp : 49 C
GPU Shutdown Temp : 97 C
GPU Slowdown Temp : 94 C
GPU Max Operating Temp : 92 C
GPU Target Temperature : 83 C
Memory Current Temp : N/A
Memory Max Operating Temp : N/A
Power Readings
Power Management : Supported
Power Draw : 18.87 W
Power Limit : 75.00 W
Default Power Limit : 75.00 W
Enforced Power Limit : 75.00 W
Min Power Limit : 45.00 W
Max Power Limit : 75.00 W
Clocks
Graphics : 1485 MHz
SM : 1485 MHz
Memory : 4001 MHz
Video : 1380 MHz
Applications Clocks
Graphics : N/A
Memory : N/A
Default Applications Clocks
Graphics : N/A
Memory : N/A
Max Clocks
Graphics : 2130 MHz
SM : 2130 MHz
Memory : 4001 MHz
Video : 1950 MHz
Max Customer Boost Clocks
Graphics : N/A
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Voltage
Graphics : N/A
Processes : None
- [ ] Docker version from
docker version
docker version
Client: Docker Engine - Community
Version: 20.10.2
API version: 1.41
Go version: go1.13.15
Git commit: 2291f61
Built: Mon Dec 28 16:17:48 2020
OS/Arch: linux/amd64
Context: default
Experimental: true
Server: Docker Engine - Community
Engine:
Version: 20.10.2
API version: 1.41 (minimum version 1.12)
Go version: go1.13.15
Git commit: 8891c58
Built: Mon Dec 28 16:16:13 2020
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: 1.4.3
GitCommit: 269548fa27e0089a8b8278fc4fc781d7f65a939b
runc:
Version: 1.0.0-rc92
GitCommit: ff819c7e9184c13b7c2607fe6c30ae19403a7aff
docker-init:
Version: 0.19.0
GitCommit: de40ad0
- [ ] NVIDIA container library version from
nvidia-container-cli -V
nvidia-container-cli -V
version: 1.3.1
build date: 2020-12-14T14:18+0000
build revision: ac02636a318fe7dcc71eaeb3cc55d0c8541c1072
build compiler: gcc 4.8.5 20150623 (Red Hat 4.8.5-44)
build platform: x86_64
build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections
@remoteweb when a create command is intercepted, the NVIDIA Container Library performs mount operations in the containers namespace. One of these includes tmpfs mounts over the following three files:
/proc/driver/nvidia/params
/proc/driver/nvidia/version
/proc/driver/nvidia/registry
The error you are seeing seems to indicate that the /proc/driver/nvidia folder does not exist on the host. Can you confirm that it does exist?
Note that there fix released in the NVIDIA Container Toolkit 1.6.0 that would address the wording of the mount error that you are seeing.
To be more specific. We updated from 440 version to 510.
So 440 driver looks like this
~ $ls -al /proc/driver/nvidia
total 0
dr-xr-xr-x. 5 root root 0 Oct 2 2021 .
dr-xr-xr-x. 6 root root 0 Oct 2 2021 ..
dr-xr-xr-x. 3 root root 0 Oct 2 2021 gpus
-r--r--r--. 1 root root 0 May 18 08:13 params
dr-xr-xr-x. 2 root root 0 May 18 08:13 patches
-rw-r--r--. 1 root root 0 May 18 08:13 registry
-rw-r--r--. 1 root root 0 May 18 08:13 suspend
-rw-r--r--. 1 root root 0 May 18 08:13 suspend_depth
-r--r--r--. 1 root root 0 May 18 08:13 version
dr-xr-xr-x. 2 root root 0 May 18 08:13 warnings
and 510 looks like this
~ #ls -la /proc/driver/nvidia
total 0
dr-xr-xr-x 6 root root 0 Dec 21 22:46 .
dr-xr-xr-x 7 root root 0 Dec 21 22:46 ..
dr-xr-xr-x 4 root root 0 May 18 08:12 capabilities
dr-xr-xr-x 3 root root 0 Dec 21 22:46 gpus
-r--r--r-- 1 root root 0 May 18 08:12 params
dr-xr-xr-x 2 root root 0 May 18 08:12 patches
-rw-r--r-- 1 root root 0 May 18 08:12 registry
-rw-r--r-- 1 root root 0 May 18 08:12 suspend
-rw-r--r-- 1 root root 0 May 18 08:12 suspend_depth
-r--r--r-- 1 root root 0 May 18 08:12 version
dr-xr-xr-x 2 root root 0 May 18 08:12 warnings
These folders do exist after the upgrade.
I am interested to know this. Every time NVIDIA upgrades my users complain docker does not work. Their bad solution was to disable NVIDIA updates. This is the worst solution. Did you guys figure out how to make docker run after an upgrade on NVIDIA drivers without having to reboot the entire system?
I am interested to know this. Every time NVIDIA upgrades my users complain docker does not work. Their bad solution was to disable NVIDIA updates. This is the worst solution. Did you guys figure out how to make docker run after an upgrade on NVIDIA drivers without having to reboot the entire system?
When you say that "docker does not work" does this mean that new containers cannot be started, or that existing containers stop working? When upgrading the driver, the libraries that are mounted into the running containers are removed and replaced by updated versions. With this in mind keeping docker containers using the drivers running through an upgrade is not a supported use case, nor is stopping and then restarting them since this may still reference the old libraries.
Restarting the docker containers (once terminated) should pick up the new libraries and binaries and mount these into the container instead.
Just thought this might help others as well.
The behaviour that we reported in the main issue (the need of docker daemon restart after nvidia driver upgrade), isn't happening when upgraded from 470 to 515. on identical systems to ones included in our initial report.
For us the following worked as expected.
1. Stop all containers using nvidia drivers
2. Unload Nvidia Kernel modules
# Unload Nvidia kernel modules
modprobe -r nvidia_drm
modprobe -r nvidia_uvm
modprobe -r nvidia_modeset
modbrobe -r nvidia
3. Install new Nvidia drivers
4. Start previously stopped containers (this time the start up is not throwing errors)
5. Confirm the new driver by running exec the nvidia-smi within the containters.
When you say that "docker does not work" does this mean that new containers cannot be started, or that existing containers stop working? When upgrading the driver, the libraries that are mounted into the running containers are removed and replaced by updated versions. With this in mind keeping docker containers using the drivers running through an upgrade is not a supported use case, nor is stopping and then restarting them since this may still reference the old libraries.
Restarting the docker containers (once terminated) should pick up the new libraries and binaries and mount these into the container instead.
@elezar I don't know for sure unfortunately since this is being handled by my co-workers without telling me what they are doing, but I believe that it is that the existing containers do not work anymore after an upgrade. They use to fix this by rebooting the server, which is something I would like to avoid. But you are saying that just restarting the container may fix the issue. Then I will have to monitor if this happens again to check.
@remoteweb thanks, but this procedure is not good for me. I update Nvidia drivers with regular system updates. I won't even try to check first if there are drivers to be updated to do all this process... I just want to keep the system updated, and it should not break running things. Maybe docker could have a rule somewhere that does something like this automatically when drivers are being upgraded.
@leoheck if the uptime is required, you need to re-design how your architecture works. For example, you could deploy a new container with new driver release, and kill the old one once the new is up. Containers normally should be stateless.
This makes sense, this is what I came here to understand. Unfortunately, I did not see the issue myself yet!