nvidia-docker icon indicating copy to clipboard operation
nvidia-docker copied to clipboard

Upgrade NVidia driver without the need to restart docker daemon

Open remoteweb opened this issue 3 years ago • 8 comments
trafficstars

1. Issue or feature description

Need to upgrade NVidia driver for host and containers without restarting docker daemon.

This applies on containers consuming GPU capabilities with Docker Nvidia Runtime.

After the host Nvidia driver is updated, and while any container using GPU is stopped, when trying to start the containers again the start fails with following error messages.

stderr: nvidia-container-cli: mount error: mount operation failed: /var/lib/docker/overlay2/307038625cd555791f9de4ea47596a7a5815ca21a7e9b6a783368637a2fb24cd/merged/proc/driver/nvidia/params/version/registry: no such file or directory: unknown"

If we restart the whole docker daemon, the container gets back online properly.

2. Steps to reproduce the issue

First make sure you are using a previous driver version, and have at least one docker container consuming GPU (eg nvidia-smi from within container)

Requirements Docker Nvidia Docker installed Nvidia Driver installed Working nvidia-smi command

docker run --name=nvidia-test --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=all nginx nvidia-smi

Nvidia-smi should be printing GPU information marking the test as passed.

Now let's reproduce the behavior in question

# stop container 
nvidia-test 

# unload kernel module
modprobe -r nvidia_drm

#download and install any higher linux driver
./Driver.run

# then restart container
docker start nvidia-test 

You should get a similar to

stderr: nvidia-container-cli: mount error: mount operation failed: /var/lib/docker/overlay2/307038625cd555791f9de4ea47596a7a5815ca21a7e9b6a783368637a2fb24cd/merged/proc/driver/nvidia/params/version/registry: no such file or directory: unknown"

error message.

If you restart docker deamon systemctl restart docker

Then the container can be brought back online docker start nvidia-test

Information to attach (optional if deemed irrelevant)

  • [ ] Some nvidia-container information:
NVRM version:   510.68.02
CUDA version:   11.6

Device Index:   0
Device Minor:   0
Model:          NVIDIA GeForce GTX 1650
Brand:          GeForce
GPU UUID:       GPU-7878ba12-9b30-8f49-3da8-7930824af120
Bus Location:   00000000:82:00.0
  • [ ] Kernel Version
Linux 3.10.0-1160.11.1.el7.x86_64 #1 SMP Fri Dec 18 16:34:56 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
  • [ ] Driver information from nvidia-smi -a
$nvidia-smi -a

==============NVSMI LOG==============

Timestamp                                 : Wed May  4 10:57:14 2022
Driver Version                            : 510.68.02
CUDA Version                              : 11.6

Attached GPUs                             : 1
GPU 00000000:82:00.0
    Product Name                          : NVIDIA GeForce GTX 1650
    Product Brand                         : GeForce
    Product Architecture                  : Turing
    Display Mode                          : Disabled
    Display Active                        : Disabled
    Persistence Mode                      : Disabled
    MIG Mode
        Current                           : N/A
        Pending                           : N/A
    Accounting Mode                       : Disabled
    Accounting Mode Buffer Size           : 4000
    Driver Model
        Current                           : N/A
        Pending                           : N/A
    Serial Number                         : N/A
    GPU UUID                              : GPU-7878ba12-9b30-8f49-3da8-7930824af120
    Minor Number                          : 0
    VBIOS Version                         : 90.17.3D.00.4E
    MultiGPU Board                        : No
    Board ID                              : 0x8200
    GPU Part Number                       : N/A
    Module ID                             : 0
    Inforom Version
        Image Version                     : G001.0000.02.04
        OEM Object                        : 1.1
        ECC Object                        : N/A
        Power Management Object           : N/A
    GPU Operation Mode
        Current                           : N/A
        Pending                           : N/A
    GSP Firmware Version                  : N/A
    GPU Virtualization Mode
        Virtualization Mode               : None
        Host VGPU Mode                    : N/A
    IBMNPU
        Relaxed Ordering Mode             : N/A
    PCI
        Bus                               : 0x82
        Device                            : 0x00
        Domain                            : 0x0000
        Device Id                         : 0x1F8210DE
        Bus Id                            : 00000000:82:00.0
        Sub System Id                     : 0x8D921462
        GPU Link Info
            PCIe Generation
                Max                       : 3
                Current                   : 3
            Link Width
                Max                       : 16x
                Current                   : 16x
        Bridge Chip
            Type                          : N/A
            Firmware                      : N/A
        Replays Since Reset               : 0
        Replay Number Rollovers           : 0
        Tx Throughput                     : 0 KB/s
        Rx Throughput                     : 0 KB/s
    Fan Speed                             : 40 %
    Performance State                     : P0
    Clocks Throttle Reasons
        Idle                              : Not Active
        Applications Clocks Setting       : Not Active
        SW Power Cap                      : Not Active
        HW Slowdown                       : Not Active
            HW Thermal Slowdown           : Not Active
            HW Power Brake Slowdown       : Not Active
        Sync Boost                        : Not Active
        SW Thermal Slowdown               : Not Active
        Display Clock Setting             : Not Active
    FB Memory Usage
        Total                             : 4096 MiB
        Reserved                          : 184 MiB
        Used                              : 0 MiB
        Free                              : 3911 MiB
    BAR1 Memory Usage
        Total                             : 256 MiB
        Used                              : 2 MiB
        Free                              : 254 MiB
    Compute Mode                          : Default
    Utilization
        Gpu                               : 0 %
        Memory                            : 0 %
        Encoder                           : 0 %
        Decoder                           : 0 %
    Encoder Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    FBC Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    Ecc Mode
        Current                           : N/A
        Pending                           : N/A
    ECC Errors
        Volatile
            SRAM Correctable              : N/A
            SRAM Uncorrectable            : N/A
            DRAM Correctable              : N/A
            DRAM Uncorrectable            : N/A
        Aggregate
            SRAM Correctable              : N/A
            SRAM Uncorrectable            : N/A
            DRAM Correctable              : N/A
            DRAM Uncorrectable            : N/A
    Retired Pages
        Single Bit ECC                    : N/A
        Double Bit ECC                    : N/A
        Pending Page Blacklist            : N/A
    Remapped Rows                         : N/A
    Temperature
        GPU Current Temp                  : 49 C
        GPU Shutdown Temp                 : 97 C
        GPU Slowdown Temp                 : 94 C
        GPU Max Operating Temp            : 92 C
        GPU Target Temperature            : 83 C
        Memory Current Temp               : N/A
        Memory Max Operating Temp         : N/A
    Power Readings
        Power Management                  : Supported
        Power Draw                        : 18.87 W
        Power Limit                       : 75.00 W
        Default Power Limit               : 75.00 W
        Enforced Power Limit              : 75.00 W
        Min Power Limit                   : 45.00 W
        Max Power Limit                   : 75.00 W
    Clocks
        Graphics                          : 1485 MHz
        SM                                : 1485 MHz
        Memory                            : 4001 MHz
        Video                             : 1380 MHz
    Applications Clocks
        Graphics                          : N/A
        Memory                            : N/A
    Default Applications Clocks
        Graphics                          : N/A
        Memory                            : N/A
    Max Clocks
        Graphics                          : 2130 MHz
        SM                                : 2130 MHz
        Memory                            : 4001 MHz
        Video                             : 1950 MHz
    Max Customer Boost Clocks
        Graphics                          : N/A
    Clock Policy
        Auto Boost                        : N/A
        Auto Boost Default                : N/A
    Voltage
        Graphics                          : N/A
    Processes                             : None

  • [ ] Docker version from docker version
docker version
Client: Docker Engine - Community
 Version:           20.10.2
 API version:       1.41
 Go version:        go1.13.15
 Git commit:        2291f61
 Built:             Mon Dec 28 16:17:48 2020
 OS/Arch:           linux/amd64
 Context:           default
 Experimental:      true

Server: Docker Engine - Community
 Engine:
  Version:          20.10.2
  API version:      1.41 (minimum version 1.12)
  Go version:       go1.13.15
  Git commit:       8891c58
  Built:            Mon Dec 28 16:16:13 2020
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.4.3
  GitCommit:        269548fa27e0089a8b8278fc4fc781d7f65a939b
 runc:
  Version:          1.0.0-rc92
  GitCommit:        ff819c7e9184c13b7c2607fe6c30ae19403a7aff
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0
  • [ ] NVIDIA container library version from nvidia-container-cli -V
nvidia-container-cli -V
version: 1.3.1
build date: 2020-12-14T14:18+0000
build revision: ac02636a318fe7dcc71eaeb3cc55d0c8541c1072
build compiler: gcc 4.8.5 20150623 (Red Hat 4.8.5-44)
build platform: x86_64
build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections

remoteweb avatar May 04 '22 14:05 remoteweb

@remoteweb when a create command is intercepted, the NVIDIA Container Library performs mount operations in the containers namespace. One of these includes tmpfs mounts over the following three files:

/proc/driver/nvidia/params
/proc/driver/nvidia/version
/proc/driver/nvidia/registry

The error you are seeing seems to indicate that the /proc/driver/nvidia folder does not exist on the host. Can you confirm that it does exist?

Note that there fix released in the NVIDIA Container Toolkit 1.6.0 that would address the wording of the mount error that you are seeing.

elezar avatar May 10 '22 10:05 elezar

To be more specific. We updated from 440 version to 510.

So 440 driver looks like this

~ $ls -al /proc/driver/nvidia
total 0
dr-xr-xr-x. 5 root root 0 Oct  2  2021 .
dr-xr-xr-x. 6 root root 0 Oct  2  2021 ..
dr-xr-xr-x. 3 root root 0 Oct  2  2021 gpus
-r--r--r--. 1 root root 0 May 18 08:13 params
dr-xr-xr-x. 2 root root 0 May 18 08:13 patches
-rw-r--r--. 1 root root 0 May 18 08:13 registry
-rw-r--r--. 1 root root 0 May 18 08:13 suspend
-rw-r--r--. 1 root root 0 May 18 08:13 suspend_depth
-r--r--r--. 1 root root 0 May 18 08:13 version
dr-xr-xr-x. 2 root root 0 May 18 08:13 warnings

and 510 looks like this

~ #ls -la /proc/driver/nvidia
total 0
dr-xr-xr-x 6 root root 0 Dec 21 22:46 .
dr-xr-xr-x 7 root root 0 Dec 21 22:46 ..
dr-xr-xr-x 4 root root 0 May 18 08:12 capabilities
dr-xr-xr-x 3 root root 0 Dec 21 22:46 gpus
-r--r--r-- 1 root root 0 May 18 08:12 params
dr-xr-xr-x 2 root root 0 May 18 08:12 patches
-rw-r--r-- 1 root root 0 May 18 08:12 registry
-rw-r--r-- 1 root root 0 May 18 08:12 suspend
-rw-r--r-- 1 root root 0 May 18 08:12 suspend_depth
-r--r--r-- 1 root root 0 May 18 08:12 version
dr-xr-xr-x 2 root root 0 May 18 08:12 warnings

These folders do exist after the upgrade.

remoteweb avatar May 18 '22 12:05 remoteweb

I am interested to know this. Every time NVIDIA upgrades my users complain docker does not work. Their bad solution was to disable NVIDIA updates. This is the worst solution. Did you guys figure out how to make docker run after an upgrade on NVIDIA drivers without having to reboot the entire system?

leoheck avatar Jun 20 '22 18:06 leoheck

I am interested to know this. Every time NVIDIA upgrades my users complain docker does not work. Their bad solution was to disable NVIDIA updates. This is the worst solution. Did you guys figure out how to make docker run after an upgrade on NVIDIA drivers without having to reboot the entire system?

When you say that "docker does not work" does this mean that new containers cannot be started, or that existing containers stop working? When upgrading the driver, the libraries that are mounted into the running containers are removed and replaced by updated versions. With this in mind keeping docker containers using the drivers running through an upgrade is not a supported use case, nor is stopping and then restarting them since this may still reference the old libraries.

Restarting the docker containers (once terminated) should pick up the new libraries and binaries and mount these into the container instead.

elezar avatar Jun 21 '22 08:06 elezar

Just thought this might help others as well.

The behaviour that we reported in the main issue (the need of docker daemon restart after nvidia driver upgrade), isn't happening when upgraded from 470 to 515. on identical systems to ones included in our initial report.

For us the following worked as expected.

1. Stop all containers using nvidia drivers

2. Unload Nvidia Kernel modules

# Unload Nvidia kernel modules
modprobe -r nvidia_drm
modprobe -r nvidia_uvm
modprobe -r nvidia_modeset
modbrobe -r nvidia

3. Install new Nvidia drivers

4. Start previously stopped containers (this time the start up is not throwing errors)

5. Confirm the new driver by running exec the nvidia-smi within the containters.

remoteweb avatar Jun 21 '22 09:06 remoteweb

When you say that "docker does not work" does this mean that new containers cannot be started, or that existing containers stop working? When upgrading the driver, the libraries that are mounted into the running containers are removed and replaced by updated versions. With this in mind keeping docker containers using the drivers running through an upgrade is not a supported use case, nor is stopping and then restarting them since this may still reference the old libraries.

Restarting the docker containers (once terminated) should pick up the new libraries and binaries and mount these into the container instead.

@elezar I don't know for sure unfortunately since this is being handled by my co-workers without telling me what they are doing, but I believe that it is that the existing containers do not work anymore after an upgrade. They use to fix this by rebooting the server, which is something I would like to avoid. But you are saying that just restarting the container may fix the issue. Then I will have to monitor if this happens again to check.

@remoteweb thanks, but this procedure is not good for me. I update Nvidia drivers with regular system updates. I won't even try to check first if there are drivers to be updated to do all this process... I just want to keep the system updated, and it should not break running things. Maybe docker could have a rule somewhere that does something like this automatically when drivers are being upgraded.

leoheck avatar Jun 21 '22 18:06 leoheck

@leoheck if the uptime is required, you need to re-design how your architecture works. For example, you could deploy a new container with new driver release, and kill the old one once the new is up. Containers normally should be stateless.

remoteweb avatar Jun 21 '22 19:06 remoteweb

This makes sense, this is what I came here to understand. Unfortunately, I did not see the issue myself yet!

leoheck avatar Jun 21 '22 21:06 leoheck