gpu: Support GPU passthrough to LXD containers using Container Device Interface (CDI)
Abstract
Don't rely exclusively on nvidia-container-cli in LXC to configure a GPU passthrough. As the former tool is being deprecated and won't work with NVIDIA Tegra iGPU, we use the nvidia-container-toolkit (see here) related tools to generate a source of truth instructing LXD on what to pass (card devices, non-card devices and runtime libraries) to the container namespace.
Rationale
The reason behind this change is related to the need for supporting Nvidia iGPU passthrough for LXD containers (e.g, Tegra SoC like AGX/IGX Orin boards). The current implementation of the Nvidia GPU passthrough for LXD containers is based on the nvidia-container-cli tool, which is not compatible with the iGPU passthrough (nvidia-configure-cli and other tools like nvidia-smi use NVML (Nvidia Management Library) that reads PCI / PCI-X info to properly work. Getting these information for an iGPU living on an AMBA bus and not on a PCI / PCI-X is therefore not compatible with NVML)
Specification
A novel approach could consist in leveraging the recent effort from Nvidia to provide a more generic and flexible tool called nvidia-container-toolkit.This project really focuses on:
-
Generating, validating and managing a platform configuration file to be integrated with an OCI runtime. This file is describing what device nodes would need to be mounted inside the container to provide the necessary access to the GPU alongside the symlinks to the Nvidia drivers and libraries. The file (JSON format) is supposed to be 'merged' with the OCI image definition (also in JSON format) to provide a complete description of the container runtime environment. The standard uses for this CDI definition can be found here
-
Providing an runtime shim to allow GPU device passthrough in OCI based container manager solutions.
One might be wondering how this would be useful as LXD does not follow the OCI specification. The idea is to re-use the generation, validation and management logic of a platform specific configuration file (which follows the CDI specification) but to adapt the mounting logic so that it can be used in a LXC context.
Instead of merging this CDI representation with the OCI image definition (which we don't have anyway), we'll read the device nodes entries and associate them to lxc.mount.entry elements. We might also have to add the associated lxc.cgroup*.devices.allow entries to allow the container to access the GPU device nodes.
Implementation details
In the id field (gputype=physical and gputype=mig), possibility to describe by CDI identifier
The id GPU config option is meant to receive the DRM card ID of the parent GPU device. We'll augment this option to also accept the CDI identifier of the GPU device. In the same fashion as OCI device passing, it means that id could be for example {VENDOR_DOMAIN_NAME}/gpu=gpu{INDEX} for each (non-MIG-enabled is the vendor is NVIDIA) full GPU in the system. For MIG-enabled GPUs, the id would be {VENDOR_DOMAIN_NAME}/gpu=mig{GPU_INDEX}:{MIG_INDEX}. We'll also have {VENDOR_DOMAIN_NAME}/gpu=all which will potentially create multiple LXD GPU devices in the container, one for each GPU in the system.
Having a CDI identifier like this gives us a way to target an iGPU that does not live on a PCIe bus (no pci config option possible)
For the iGPU, we'll not introduce a new gputype but we'll use physical except that the id will have to be a CDI identifier (because there is no PCI address to map to the device).
This approach is ideal for the end user because we did not introduce a new gputype for the iGPU nor changed the config options for the other gputypes. The id option is simply more flexible and can accept a DRM card ID or a CDI identifier (in this case, it'll override vendorid, productid, pci and even mig.* if the card is MIG-enabled). The rest of the machinery is hidden from the user.
With this change, we built up this development roadmap:
-
Augmenting
gputype=physical: theidconfig options need to be augmented to support a CDI identifier. We need more validation rules. Handle theallcase. -
Augmenting
gputype=mig: theidconfig options need to be augmented a CDI identifier. Same as above. Handle theallcase. -
CDI spec generation: we need to generate the CDI spec each time we start the gpu device (in the
startContainerfunction). -
CDI spec translation: if a CDI spec has been successfully generated and validated against the GPU device the user has queried, we need to translate this spec into actionable LXC mount entries + cgroup perm. The hooks need to be adapted to our format.
-
LXC driver: detect if a GPU device has been started with CDI. If so, redirect the LXC hook from being
hooks/nvidiato beingcdi-hook. -
Creation of the
cdi-hookbinary in the LXD project. This binary will be responsible for executing all the hooks (pivoting to the container rootfs then update the ldcache, and then create the symlinks). This binary source code should live at the top-level project structure (like other LXD tools). That way, we (Canonical) really own this hook and do not force the Linux Containers project to maintain it as part of theirhooksfolder. In case of a CDI usage, we just executecdi-hookinstead of thehooks/nvidiahook in LXC for the Linux Container project. -
Adapt snap packaging to include the
cdi-hookbinary. -
Real-hardware testing: we can run a CUDA workload inside a container to check that the devices and the runtime libraries have been passed through.
API changes
No API changes expected.
CLI changes
No CLI changes expected.
Database changes
No database changes expected.
TODO:
- [x] PoC tool
- [x] snap package + integration test with dGPU + iGPU
Heads up @mionaalex - the "Documentation" label was applied to this issue.
@gabrielmougard as discussed in the 1:1 lets try and use the existing unix, disk and raw.lxc settings to confirm the theory that this allows iGPU cards to be used in the the container, and if so then we can move onto discussing what the user experience and implementation will be. Thanks
/cc @elezar
/cc @zvonkok
@tomponline this current implementation works for the dGPU / iGPU passthrough (working on the documentation) with docker nested inside a LXD container (docker inside reequires security.nesting=true and security.privileged=true). We can use this cloud init script cloud-init.yaml to test it:
#cloud-config
package_update: true
packages:
- docker.io
write_files:
- path: /etc/docker/daemon.json
permissions: '0644'
owner: root:root
content: |
{
"max-concurrent-downloads": 12,
"max-concurrent-uploads": 12,
"runtimes": {
"nvidia": {
"args": [],
"path": "nvidia-container-runtime"
}
}
}
- path: /root/run_tensorrt.sh
permissions: '0755'
owner: root:root
content: |
#!/bin/bash
echo "OS release,Kernel version"
(. /etc/os-release; echo "${PRETTY_NAME}"; uname -r) | paste -s -d,
echo
nvidia-smi -q
echo
exec bash -o pipefail -c "
cd /workspace/tensorrt/samples
make -j4
cd /workspace/tensorrt/bin
./sample_onnx_mnist
retstatus=\${PIPESTATUS[0]}
echo \"Test exited with status code: \${retstatus}\" >&2
exit \${retstatus}
"
runcmd:
- systemctl start docker
- systemctl enable docker
- usermod -aG docker root
- curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
- curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
- apt-get update
- DEBIAN_FRONTEND=noninteractive apt-get install -y nvidia-container-toolkit
- nvidia-ctk runtime configure
- systemctl restart docker
Then create the instance with the GPU
lxc init ubuntu:jammy t1 --config security.nesting=true --config security.privileged=true
lxc config set t1 cloud-init.user-data - < cloud-init.yml
# If the machine has a dGPU
lxc config device add t1 dgpu0 gpu gputype=physical id=nvidia.com/gpu=gpu0
# O, if the machine has an iGPU
lxc config device add t1 igpu0 gpu gputype=physical id=nvidia.com/gpu=igpu0
lxc start t1
lxc shell t1
root@t1 # docker run --gpus all --rm -v $(pwd):/sh_input nvcr.io/nvidia/tensorrt:24.02-py3 bash /sh_input/run_tensorrt.sh
If you passsed an iGPU, when you enter the container, go to the /etc/nvidia-container-runtime/config.toml and use mode=csv instead of mode=auto, then execute the docker command:
docker run --gpus all --rm -v $(pwd):/sh_input nvcr.io/nvidia/tensorrt:24.02-py3-igpu bash /sh_input/run_tensorrt.sh
@elezar thanks for this detailed review! Taking a look.
Please rebase
@gabrielmougard Can you let me know when you want me to check the docs again? There's still some open suggestions.
@gabrielmougard Can you let me know when you want me to check the docs again? There's still some open suggestions.
I'm still running some tests with the PR for now but I'll probably update the doc with the changes you suggested right after (probably today in the end of the afternoon)
@tomponline updated (except for the doc howto which I'm still working on)
Static analysis issues
thanks for your help with this @elezar !
Thanks a lot @elezar !
ready for rebase
Please can you rebase
@gabrielmougard can you also check that first commit as there is a persistent failure downloading the go deps on the tests, might need to refresh that commit so it reflects the current state in main
@tomponline updated. Although there seems to be a new issue with the clustering tests (that I also saw this morning with https://github.com/canonical/lxd/pull/13995)
@tomponline updated. Although there seems to be a new issue with the clustering tests (that I also saw this morning with #13995)
That seems related to the dqlite ppa, have flagged to dqlite team but not related to your PR. Thanks for flagging though.