GPUs exposed with CDI not visible in build
I would like to use GPUs in my container builds on my CI pipeline with buildah.
For that I'm using the NVIDIA ctk to generate the cdi yaml file, which looks fine and is used successfully by podman run:
podman run --device nvidia.com/gpu=all docker.io/nvidia/cuda:12.2.0-base-ubuntu22.04 nvidia-smi
However, trying to run the following build command
podman build --format docker --file docker/Dockerfile --device nvidia.com/gpu=all
with the following Dockerfile fails
FROM docker.io/nvidia/cuda:12.2.0-base-ubuntu22.04
USER root
RUN df -h
RUN nvidia-smi
with the following output:
time="2025-01-24T10:56:15Z" level=warning msg="Using cgroups-v1 which is deprecated in favor of cgroups-v2 with Podman v5 and will be removed in a future version. Set environment variable `PODMAN_IGNORE_CGROUPSV1_WARNING` to hide this warning."
STEP 1/4: FROM docker.io/nvidia/cuda:12.2.0-base-ubuntu22.04
STEP 2/4: USER root
--> a85047a34ada
STEP 3/4: RUN df -h
time="2025-01-24T10:56:16Z" level=warning msg="Implicit hook directories are deprecated; set --hooks-dir=\"/usr/share/containers/oci/hooks.d\" explicitly to continue to load ociHooks from this directory"
time="2025-01-24T10:56:16Z" level=warning msg="Implicit hook directories are deprecated; set --hooks-dir=\"/etc/containers/oci/hooks.d\" explicitly to continue to load ociHooks from this directory"
Filesystem Size Used Avail Use% Mounted on
fuse-overlayfs 1.8T 219G 1.6T 13% /
tmpfs 64M 0 64M 0% /dev
/dev/md127 1.8T 219G 1.6T 13% /dev/termination-log
shm 64M 84K 64M 1% /dev/shm
tmpfs 504G 12K 504G 1% /proc/driver/nvidia
overlay 1.8T 219G 1.6T 13% /proc/acpi
tmpfs 504G 0 504G 0% /sys/fs/cgroup
--> 6f29b1abb0b9
STEP 4/4: RUN nvidia-smi
time="2025-01-24T10:56:19Z" level=warning msg="Implicit hook directories are deprecated; set --hooks-dir=\"/usr/share/containers/oci/hooks.d\" explicitly to continue to load ociHooks from this directory"
time="2025-01-24T10:56:19Z" level=warning msg="Implicit hook directories are deprecated; set --hooks-dir=\"/etc/containers/oci/hooks.d\" explicitly to continue to load ociHooks from this directory"
/bin/sh: 1: nvidia-smi: not found
subprocess exited with status 127
subprocess exited with status 127
Error: building at STEP "RUN nvidia-smi": exit status 127
I'm using the following versions:
$ buildah -v
buildah version 1.38.0 (image-spec 1.1.0, runtime-spec 1.2.0)
$ podman -v
podman version 5.3.1
So in comparison to the issues #5556 and #5813, this is more of a "slient" failure since the build executor is created without complaining about the device.
I'm not sure how I should further debug this issue.... I've also tried with buildah versions 1.36.0 and 1.37.5, which both also support CDI device specification, but with the same outcome.
Any hints on how to fix this? Thanks a lot in advance!
PS:
Here's also the output from nvidia-ctk:
$ nvidia-ctk cdi list
time="2025-01-24T10:49:36Z" level=info msg="Found 17 CDI devices"
nvidia.com/gpu=0
nvidia.com/gpu=1
nvidia.com/gpu=2
nvidia.com/gpu=3
nvidia.com/gpu=4
nvidia.com/gpu=5
nvidia.com/gpu=6
nvidia.com/gpu=7
nvidia.com/gpu=GPU-12efe259-604a-6c44-c58c-4178d4c35d3e
nvidia.com/gpu=GPU-133c740e-bad1-bde5-325d-4a49eec5dfae
nvidia.com/gpu=GPU-24db9a46-a825-5e6b-2950-07d51fb79aed
nvidia.com/gpu=GPU-4bd7a46e-8df1-3a65-5038-7b3a4baec73c
nvidia.com/gpu=GPU-8ed47b3f-a71e-71c7-18a6-bd37bf1cde8a
nvidia.com/gpu=GPU-c06d0cc5-b033-4cb4-977c-4907b6f50f5e
nvidia.com/gpu=GPU-d8afe118-eb71-e279-b65c-bc4d1640c63a
nvidia.com/gpu=GPU-dd122b28-6236-854d-b42f-6bd45143d55b
nvidia.com/gpu=all
A friendly reminder that this issue had no activity for 30 days.
Not stale.
A friendly reminder that this issue had no activity for 30 days.
is there any workaround available for this?