buildah icon indicating copy to clipboard operation
buildah copied to clipboard

GPUs exposed with CDI not visible in build

Open ghost opened this issue 10 months ago • 3 comments

I would like to use GPUs in my container builds on my CI pipeline with buildah. For that I'm using the NVIDIA ctk to generate the cdi yaml file, which looks fine and is used successfully by podman run:

podman run --device nvidia.com/gpu=all docker.io/nvidia/cuda:12.2.0-base-ubuntu22.04 nvidia-smi

However, trying to run the following build command

podman build --format docker --file docker/Dockerfile --device nvidia.com/gpu=all

with the following Dockerfile fails

FROM docker.io/nvidia/cuda:12.2.0-base-ubuntu22.04

USER root
RUN df -h
RUN nvidia-smi

with the following output:

time="2025-01-24T10:56:15Z" level=warning msg="Using cgroups-v1 which is deprecated in favor of cgroups-v2 with Podman v5 and will be removed in a future version. Set environment variable `PODMAN_IGNORE_CGROUPSV1_WARNING` to hide this warning."
STEP 1/4: FROM docker.io/nvidia/cuda:12.2.0-base-ubuntu22.04
STEP 2/4: USER root
--> a85047a34ada
STEP 3/4: RUN df -h
time="2025-01-24T10:56:16Z" level=warning msg="Implicit hook directories are deprecated; set --hooks-dir=\"/usr/share/containers/oci/hooks.d\" explicitly to continue to load ociHooks from this directory"
time="2025-01-24T10:56:16Z" level=warning msg="Implicit hook directories are deprecated; set --hooks-dir=\"/etc/containers/oci/hooks.d\" explicitly to continue to load ociHooks from this directory"
Filesystem      Size  Used Avail Use% Mounted on
fuse-overlayfs  1.8T  219G  1.6T  13% /
tmpfs            64M     0   64M   0% /dev
/dev/md127      1.8T  219G  1.6T  13% /dev/termination-log
shm              64M   84K   64M   1% /dev/shm
tmpfs           504G   12K  504G   1% /proc/driver/nvidia
overlay         1.8T  219G  1.6T  13% /proc/acpi
tmpfs           504G     0  504G   0% /sys/fs/cgroup
--> 6f29b1abb0b9
STEP 4/4: RUN nvidia-smi
time="2025-01-24T10:56:19Z" level=warning msg="Implicit hook directories are deprecated; set --hooks-dir=\"/usr/share/containers/oci/hooks.d\" explicitly to continue to load ociHooks from this directory"
time="2025-01-24T10:56:19Z" level=warning msg="Implicit hook directories are deprecated; set --hooks-dir=\"/etc/containers/oci/hooks.d\" explicitly to continue to load ociHooks from this directory"
/bin/sh: 1: nvidia-smi: not found
subprocess exited with status 127
subprocess exited with status 127
Error: building at STEP "RUN nvidia-smi": exit status 127

I'm using the following versions:

$ buildah -v
buildah version 1.38.0 (image-spec 1.1.0, runtime-spec 1.2.0)
$ podman -v
podman version 5.3.1

So in comparison to the issues #5556 and #5813, this is more of a "slient" failure since the build executor is created without complaining about the device.

I'm not sure how I should further debug this issue.... I've also tried with buildah versions 1.36.0 and 1.37.5, which both also support CDI device specification, but with the same outcome.

Any hints on how to fix this? Thanks a lot in advance!

PS:

Here's also the output from nvidia-ctk:

$ nvidia-ctk cdi list
time="2025-01-24T10:49:36Z" level=info msg="Found 17 CDI devices"
nvidia.com/gpu=0
nvidia.com/gpu=1
nvidia.com/gpu=2
nvidia.com/gpu=3
nvidia.com/gpu=4
nvidia.com/gpu=5
nvidia.com/gpu=6
nvidia.com/gpu=7
nvidia.com/gpu=GPU-12efe259-604a-6c44-c58c-4178d4c35d3e
nvidia.com/gpu=GPU-133c740e-bad1-bde5-325d-4a49eec5dfae
nvidia.com/gpu=GPU-24db9a46-a825-5e6b-2950-07d51fb79aed
nvidia.com/gpu=GPU-4bd7a46e-8df1-3a65-5038-7b3a4baec73c
nvidia.com/gpu=GPU-8ed47b3f-a71e-71c7-18a6-bd37bf1cde8a
nvidia.com/gpu=GPU-c06d0cc5-b033-4cb4-977c-4907b6f50f5e
nvidia.com/gpu=GPU-d8afe118-eb71-e279-b65c-bc4d1640c63a
nvidia.com/gpu=GPU-dd122b28-6236-854d-b42f-6bd45143d55b
nvidia.com/gpu=all

ghost avatar Feb 24 '25 16:02 ghost

A friendly reminder that this issue had no activity for 30 days.

github-actions[bot] avatar Mar 27 '25 00:03 github-actions[bot]

Not stale.

sanmai-NL avatar Apr 08 '25 07:04 sanmai-NL

A friendly reminder that this issue had no activity for 30 days.

github-actions[bot] avatar May 09 '25 00:05 github-actions[bot]

is there any workaround available for this?

rupansh avatar Jul 28 '25 16:07 rupansh