lxd gpu: Support GPU passthrough to LXD containers using Container Device Interface (CDI)

Abstract

Don't rely exclusively on nvidia-container-cli in LXC to configure a GPU passthrough. As the former tool is being deprecated and won't work with NVIDIA Tegra iGPU, we use the nvidia-container-toolkit (see here) related tools to generate a source of truth instructing LXD on what to pass (card devices, non-card devices and runtime libraries) to the container namespace.

Rationale

The reason behind this change is related to the need for supporting Nvidia iGPU passthrough for LXD containers (e.g, Tegra SoC like AGX/IGX Orin boards). The current implementation of the Nvidia GPU passthrough for LXD containers is based on the nvidia-container-cli tool, which is not compatible with the iGPU passthrough (nvidia-configure-cli and other tools like nvidia-smi use NVML (Nvidia Management Library) that reads PCI / PCI-X info to properly work. Getting these information for an iGPU living on an AMBA bus and not on a PCI / PCI-X is therefore not compatible with NVML)

Specification

A novel approach could consist in leveraging the recent effort from Nvidia to provide a more generic and flexible tool called nvidia-container-toolkit.This project really focuses on:

Generating, validating and managing a platform configuration file to be integrated with an OCI runtime. This file is describing what device nodes would need to be mounted inside the container to provide the necessary access to the GPU alongside the symlinks to the Nvidia drivers and libraries. The file (JSON format) is supposed to be 'merged' with the OCI image definition (also in JSON format) to provide a complete description of the container runtime environment. The standard uses for this CDI definition can be found here
Providing an runtime shim to allow GPU device passthrough in OCI based container manager solutions.

One might be wondering how this would be useful as LXD does not follow the OCI specification. The idea is to re-use the generation, validation and management logic of a platform specific configuration file (which follows the CDI specification) but to adapt the mounting logic so that it can be used in a LXC context.

Instead of merging this CDI representation with the OCI image definition (which we don't have anyway), we'll read the device nodes entries and associate them to lxc.mount.entry elements. We might also have to add the associated lxc.cgroup*.devices.allow entries to allow the container to access the GPU device nodes.

Implementation details

In the `id` field (`gputype=physical` and `gputype=mig`), possibility to describe by CDI identifier

The id GPU config option is meant to receive the DRM card ID of the parent GPU device. We'll augment this option to also accept the CDI identifier of the GPU device. In the same fashion as OCI device passing, it means that id could be for example {VENDOR_DOMAIN_NAME}/gpu=gpu{INDEX} for each (non-MIG-enabled is the vendor is NVIDIA) full GPU in the system. For MIG-enabled GPUs, the id would be {VENDOR_DOMAIN_NAME}/gpu=mig{GPU_INDEX}:{MIG_INDEX}. We'll also have {VENDOR_DOMAIN_NAME}/gpu=all which will potentially create multiple LXD GPU devices in the container, one for each GPU in the system.

Having a CDI identifier like this gives us a way to target an iGPU that does not live on a PCIe bus (no pci config option possible)

For the iGPU, we'll not introduce a new gputype but we'll use physical except that the id will have to be a CDI identifier (because there is no PCI address to map to the device).

This approach is ideal for the end user because we did not introduce a new gputype for the iGPU nor changed the config options for the other gputypes. The id option is simply more flexible and can accept a DRM card ID or a CDI identifier (in this case, it'll override vendorid, productid, pci and even mig.* if the card is MIG-enabled). The rest of the machinery is hidden from the user.

With this change, we built up this development roadmap:

Augmenting gputype=physical: the id config options need to be augmented to support a CDI identifier. We need more validation rules. Handle the all case.
Augmenting gputype=mig: the id config options need to be augmented a CDI identifier. Same as above. Handle the all case.
CDI spec generation: we need to generate the CDI spec each time we start the gpu device (in the startContainer function).
CDI spec translation: if a CDI spec has been successfully generated and validated against the GPU device the user has queried, we need to translate this spec into actionable LXC mount entries + cgroup perm. The hooks need to be adapted to our format.
LXC driver: detect if a GPU device has been started with CDI. If so, redirect the LXC hook from being hooks/nvidia to being cdi-hook.
Creation of the cdi-hook binary in the LXD project. This binary will be responsible for executing all the hooks (pivoting to the container rootfs then update the ldcache, and then create the symlinks). This binary source code should live at the top-level project structure (like other LXD tools). That way, we (Canonical) really own this hook and do not force the Linux Containers project to maintain it as part of their hooks folder. In case of a CDI usage, we just execute cdi-hook instead of the hooks/nvidia hook in LXC for the Linux Container project.
Adapt snap packaging to include the cdi-hook binary.
Real-hardware testing: we can run a CUDA workload inside a container to check that the devices and the runtime libraries have been passed through.

API changes

No API changes expected.

CLI changes

No CLI changes expected.

Database changes

No database changes expected.

TODO:

[x] PoC tool
[x] snap package + integration test with dGPU + iGPU

Jun 06 '24 16:06 gabrielmougard

Heads up @mionaalex - the "Documentation" label was applied to this issue.

Jun 06 '24 16:06 github-actions[bot]

@gabrielmougard as discussed in the 1:1 lets try and use the existing unix, disk and raw.lxc settings to confirm the theory that this allows iGPU cards to be used in the the container, and if so then we can move onto discussing what the user experience and implementation will be. Thanks

Jun 10 '24 14:06 tomponline

/cc @elezar

Jun 11 '24 14:06 elezar

/cc @zvonkok

Jun 11 '24 15:06 zvonkok

@tomponline this current implementation works for the dGPU / iGPU passthrough (working on the documentation) with docker nested inside a LXD container (docker inside reequires security.nesting=true and security.privileged=true). We can use this cloud init script cloud-init.yaml to test it:

#cloud-config
package_update: true
packages:
  - docker.io
write_files:
  - path: /etc/docker/daemon.json
    permissions: '0644'
    owner: root:root
    content: |
      {
        "max-concurrent-downloads": 12,
        "max-concurrent-uploads": 12, 
        "runtimes": {
          "nvidia": {
            "args": [],
            "path": "nvidia-container-runtime"
          }
        }
      }
  - path: /root/run_tensorrt.sh
    permissions: '0755'
    owner: root:root
    content: |
      #!/bin/bash
      echo "OS release,Kernel version"
      (. /etc/os-release; echo "${PRETTY_NAME}"; uname -r) | paste -s -d,
      echo
      nvidia-smi -q
      echo
      exec bash -o pipefail -c "
      cd /workspace/tensorrt/samples
      make -j4
      cd /workspace/tensorrt/bin
      ./sample_onnx_mnist
      retstatus=\${PIPESTATUS[0]}
      echo \"Test exited with status code: \${retstatus}\" >&2
      exit \${retstatus}
      "
runcmd:
  - systemctl start docker
  - systemctl enable docker
  - usermod -aG docker root
  - curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
  - curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
  - apt-get update
  - DEBIAN_FRONTEND=noninteractive apt-get install -y nvidia-container-toolkit
  - nvidia-ctk runtime configure
  - systemctl restart docker

Then create the instance with the GPU

lxc init ubuntu:jammy t1 --config security.nesting=true --config security.privileged=true
lxc config set t1 cloud-init.user-data - < cloud-init.yml

# If the machine has a dGPU
lxc config device add t1 dgpu0 gpu gputype=physical id=nvidia.com/gpu=gpu0

# O, if the machine has an iGPU
lxc config device add t1 igpu0 gpu gputype=physical id=nvidia.com/gpu=igpu0

lxc start t1

lxc shell t1
root@t1 # docker run --gpus all --rm -v $(pwd):/sh_input nvcr.io/nvidia/tensorrt:24.02-py3 bash /sh_input/run_tensorrt.sh

If you passsed an iGPU, when you enter the container, go to the /etc/nvidia-container-runtime/config.toml and use mode=csv instead of mode=auto, then execute the docker command:

docker run --gpus all --rm -v $(pwd):/sh_input nvcr.io/nvidia/tensorrt:24.02-py3-igpu bash /sh_input/run_tensorrt.sh

Jun 20 '24 17:06 gabrielmougard

@elezar thanks for this detailed review! Taking a look.

Jun 25 '24 14:06 gabrielmougard

Please rebase

Jul 11 '24 10:07 tomponline

@gabrielmougard Can you let me know when you want me to check the docs again? There's still some open suggestions.

Jul 24 '24 11:07 ru-fu

@gabrielmougard Can you let me know when you want me to check the docs again? There's still some open suggestions.

I'm still running some tests with the PR for now but I'll probably update the doc with the changes you suggested right after (probably today in the end of the afternoon)

Jul 24 '24 12:07 gabrielmougard

@tomponline updated (except for the doc howto which I'm still working on)

Jul 29 '24 19:07 gabrielmougard

Static analysis issues

Jul 29 '24 19:07 tomponline

thanks for your help with this @elezar !

Aug 02 '24 12:08 tomponline

Thanks a lot @elezar !

Aug 02 '24 15:08 gabrielmougard

ready for rebase

Aug 07 '24 07:08 tomponline

Please can you rebase

Aug 28 '24 07:08 tomponline

@gabrielmougard can you also check that first commit as there is a persistent failure downloading the go deps on the tests, might need to refresh that commit so it reflects the current state in main

Aug 28 '24 07:08 tomponline

@tomponline updated. Although there seems to be a new issue with the clustering tests (that I also saw this morning with https://github.com/canonical/lxd/pull/13995)

Aug 28 '24 09:08 gabrielmougard

@tomponline updated. Although there seems to be a new issue with the clustering tests (that I also saw this morning with #13995)

That seems related to the dqlite ppa, have flagged to dqlite team but not related to your PR. Thanks for flagging though.

Aug 28 '24 09:08 tomponline

gpu: Support GPU passthrough to LXD containers using Container Device Interface (CDI)

Abstract

Rationale

Specification

Implementation details

In the id field (gputype=physical and gputype=mig), possibility to describe by CDI identifier

API changes

CLI changes

Database changes

In the `id` field (`gputype=physical` and `gputype=mig`), possibility to describe by CDI identifier