nvidia-container-toolkit broken and cgroups v2 issues
How did you upgrade to 21.10? (Fresh install / Upgrade)
Upgrade from 21.04 (actually it was quite accidental in sense I was not aware it was still beta :))
Related Application and/or Package Version (run apt policy $PACKAGE NAME):
nvidia-container-toolkit:
Installed: 1.5.1-1
Candidate: 1.5.1-1
Version table:
*** 1.5.1-1 500
500 https://nvidia.github.io/nvidia-container-runtime/stable/ubuntu18.04/amd64 Packages
100 /var/lib/dpkg/status
1.5.0-1 500
500 https://nvidia.github.io/nvidia-container-runtime/stable/ubuntu18.04/amd64 Packages
1.4.2-1 500
500 https://nvidia.github.io/nvidia-container-runtime/stable/ubuntu18.04/amd64 Packages
1.4.1-1 500
500 https://nvidia.github.io/nvidia-container-runtime/stable/ubuntu18.04/amd64 Packages
1.4.0-1 500
500 https://nvidia.github.io/nvidia-container-runtime/stable/ubuntu18.04/amd64 Packages
1.3.0-1 500
500 https://nvidia.github.io/nvidia-container-runtime/stable/ubuntu18.04/amd64 Packages
1.2.1-1 500
500 https://nvidia.github.io/nvidia-container-runtime/stable/ubuntu18.04/amd64 Packages
1.2.0-1 500
500 https://nvidia.github.io/nvidia-container-runtime/stable/ubuntu18.04/amd64 Packages
1.1.2-1 500
500 https://nvidia.github.io/nvidia-container-runtime/stable/ubuntu18.04/amd64 Packages
1.1.1-1 500
500 https://nvidia.github.io/nvidia-container-runtime/stable/ubuntu18.04/amd64 Packages
1.1.0-1 500
500 https://nvidia.github.io/nvidia-container-runtime/stable/ubuntu18.04/amd64 Packages
1.0.5-1 500
500 https://nvidia.github.io/nvidia-container-runtime/stable/ubuntu18.04/amd64 Packages
1.0.4-1 500
500 https://nvidia.github.io/nvidia-container-runtime/stable/ubuntu18.04/amd64 Packages
1.0.3-1 500
500 https://nvidia.github.io/nvidia-container-runtime/stable/ubuntu18.04/amd64 Packages
1.0.2-1 500
500 https://nvidia.github.io/nvidia-container-runtime/stable/ubuntu18.04/amd64 Packages
Issue/Bug Description:
Package nividia-container-toolkit was missing. Previously it was provided from
nvidia-container-toolkit:
Installed: 1.5.1-1pop1~1627998766~21.04~9847cf2
Candidate: 1.5.1-1pop1~1627998766~21.04~9847cf2
Version table:
*** 1.5.1-1pop1~1627998766~21.04~9847cf2 1001
1001 http://ppa.launchpad.net/system76/pop/ubuntu hirsute/main amd64 Packages
100 /var/lib/dpkg/status
I did have to try to get it from older releases with
distribution=ubuntu20.04
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
but then I was getting
$ docker run --gpus all nvidia/cuda:10.0-base nvidia-smi
docker: Error response from daemon: failed to create shim: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: container error: cgroup subsystem devices not found: unknown.
ERRO[0000] error waiting for container: context canceled
Seems that this is issue with cgroups v2 (googling for error leads to quite a few issues out there already reported - I will try to compile list later) and the workaround (not a solution) seemed to be
sudo kernelstub -a "systemd.unified_cgroup_hierarchy=0"
sudo update-initramfs -c -k all
sudo reboot
Steps to reproduce (if you know):
- Get 21.10 PopOS
- Install nvidia-container-toolkit (and other nvidia stuff)
- Try to use
docker run --gpus all ...command
Expected behavior:
it works fine with output along
docker run --gpus all nvidia/cuda:10.0-base nvidia-smi
Thu Nov 11 10:21:10 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.63.01 Driver Version: 470.63.01 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:01:00.0 On | N/A |
| 0% 39C P8 7W / 185W | 1486MiB / 7979MiB | 19% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
Other Notes:
Happy to provide additional information. I planned to reinstall my machine back to 21.04 but decided to postpone by a day or two in case you'd like to get some more information about the problem or have some advice.
May be related: https://github.com/NVIDIA/nvidia-docker/issues/1447
Note that only versions after v1.8.0 of the NVIDIA Container Toolkit (including libnvidia-container1) support cgroupv2. Please install a more recent version and see if this addresses your issue.