buildah
buildah copied to clipboard
Use of CDI does not consume labeled devices during build
Issue Description
When using NVIDIA GPUs with Podman via the Container Device Interface podman build fails to use labeled devices while podman run works as intended.
However, if using the direct device path the podman build execution works as expected.
Steps to reproduce the issue
Steps to reproduce the issue
- Install NVIDIA Drivers
- Install Podman
- Install NVIDIA Container Toolkit:
curl -s -L https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo | sudo tee /etc/yum.repos.d/nvidia-container-toolkit.repo
sudo dnf install -y nvidia-container-toolkit
- Configure NVIDIA CTK for use with CDI:
nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml - Test CDI integration for
podman runwhich works:podman run --rm --device nvidia.com/gpu=all --security-opt=label=disable ubuntu nvidia-smi -L - Start a
podman buildwith the same device label which fails:
# Get a test containerfile
curl -O https://raw.githubusercontent.com/kenmoini/smart-drone-patterns/main/apps/darknet/Containerfile.ubnt22
# Build a container with the device label which fails
podman build --device nvidia.com/gpu=all --security-opt=label=disable -t darknet -f Containerfile.ubnt22 .
# - Output
Error: creating build executor: getting info of source device nvidia.com/gpu=all: stat nvidia.com/gpu=all: no such file or directory
# Build a container with the direct device path which works
podman build --device /dev/nvidia0 -t darknet -f Containerfile.ubnt22 --security-opt=label=disable .
Describe the results you received
The result of using the CDI device label fails:
podman build --device nvidia.com/gpu=all --security-opt=label=disable -t darknet -f Containerfile.ubnt22 .
Error: creating build executor: getting info of source device nvidia.com/gpu=all: stat nvidia.com/gpu=all: no such file or directory
Describe the results you expected
The container build to start with the device label - only works if you use the device path, but that doesn't seem to load all the associated paths that are defined in the generated CDI configuration.
podman info output
host:
arch: arm64
buildahVersion: 1.31.3
cgroupControllers:
- cpuset
- cpu
- io
- memory
- hugetlb
- pids
- rdma
- misc
cgroupManager: systemd
cgroupVersion: v2
conmon:
package: conmon-2.1.8-1.el9.aarch64
path: /usr/bin/conmon
version: 'conmon version 2.1.8, commit: f0f506932ce1dc9fc7f1adb457a73d0a00207272'
cpuUtilization:
idlePercent: 99.98
systemPercent: 0.01
userPercent: 0.01
cpus: 32
databaseBackend: boltdb
distribution:
distribution: '"rhel"'
version: "9.3"
eventLogger: journald
freeLocks: 2048
hostname: avalon.kemo.labs
idMappings:
gidmap: null
uidmap: null
kernel: 5.14.0-362.18.1.el9_3.aarch64
linkmode: dynamic
logDriver: journald
memFree: 121339949056
memTotal: 133915746304
networkBackend: netavark
networkBackendInfo:
backend: netavark
dns:
package: aardvark-dns-1.7.0-1.el9.aarch64
path: /usr/libexec/podman/aardvark-dns
version: aardvark-dns 1.7.0
package: netavark-1.7.0-2.el9_3.aarch64
path: /usr/libexec/podman/netavark
version: netavark 1.7.0
ociRuntime:
name: crun
package: crun-1.8.7-1.el9.aarch64
path: /usr/bin/crun
version: |-
crun version 1.8.7
commit: 53a9996ce82d1ee818349bdcc64797a1fa0433c4
rundir: /run/crun
spec: 1.0.0
+SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +CRIU +YAJL
os: linux
pasta:
executable: /bin/pasta
package: passt-0^20230818.g0af928e-4.el9.aarch64
version: |
pasta 0^20230818.g0af928e-4.el9.aarch64
Copyright Red Hat
GNU Affero GPL version 3 or later <https://www.gnu.org/licenses/agpl-3.0.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
remoteSocket:
exists: true
path: /run/podman/podman.sock
security:
apparmorEnabled: false
capabilities: CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_NET_BIND_SERVICE,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID,CAP_SYS_CHROOT
rootless: false
seccompEnabled: true
seccompProfilePath: /usr/share/containers/seccomp.json
selinuxEnabled: true
serviceIsRemote: false
slirp4netns:
executable: /bin/slirp4netns
package: slirp4netns-1.2.1-1.el9.aarch64
version: |-
slirp4netns version 1.2.1
commit: 09e31e92fa3d2a1d3ca261adaeb012c8d75a8194
libslirp: 4.4.0
SLIRP_CONFIG_VERSION_MAX: 3
libseccomp: 2.5.2
swapFree: 4294963200
swapTotal: 4294963200
uptime: 105h 12m 27.00s (Approximately 4.38 days)
plugins:
authorization: null
log:
- k8s-file
- none
- passthrough
- journald
network:
- bridge
- macvlan
- ipvlan
volume:
- local
registries:
search:
- registry.access.redhat.com
- registry.redhat.io
- docker.io
store:
configFile: /etc/containers/storage.conf
containerStore:
number: 0
paused: 0
running: 0
stopped: 0
graphDriverName: overlay
graphOptions:
overlay.mountopt: nodev,metacopy=on
graphRoot: /var/lib/containers/storage
graphRootAllocated: 1993421922304
graphRootUsed: 28735803392
graphStatus:
Backing Filesystem: xfs
Native Overlay Diff: "false"
Supports d_type: "true"
Using metacopy: "true"
imageCopyTmpDir: /var/tmp
imageStore:
number: 4
runRoot: /run/containers/storage
transientStore: false
volumePath: /var/lib/containers/storage/volumes
version:
APIVersion: 4.6.1
Built: 1705652546
BuiltTime: Fri Jan 19 03:22:26 2024
GitCommit: ""
GoVersion: go1.20.12
Os: linux
OsArch: linux/arm64
Version: 4.6.1
Podman in a container
No
Privileged Or Rootless
Privileged
Upstream Latest Release
No
Additional environment details
Running on RHEL 9.3 on an Ampere Altra system - same error on an X86 system.
Additional information
Looks like this also affects buildah: https://github.com/containers/buildah/issues/5432 https://github.com/containers/buildah/pull/5443
A friendly reminder that this issue had no activity for 30 days.
Same here!! We need to access GPUs for some builds, not only when running the container.
@nalind PTAL
This should work as of 1.36, which includes #5443 and #5494.
I'm still experiencing a similar issue even with newer versions of buildah.
On my CI pipeline, the following command runs fine:
podman run --device nvidia.com/gpu=all docker.io/nvidia/cuda:12.2.0-base-ubuntu22.04 nvidia-smi
whereas
podman build --format docker --file docker/Dockerfile --device nvidia.com/gpu=all
with the following Dockerfile fails
FROM docker.io/nvidia/cuda:12.2.0-base-ubuntu22.04
USER root
RUN df -h
RUN nvidia-smi
with the following output:
time="2025-01-24T10:56:15Z" level=warning msg="Using cgroups-v1 which is deprecated in favor of cgroups-v2 with Podman v5 and will be removed in a future version. Set environment variable `PODMAN_IGNORE_CGROUPSV1_WARNING` to hide this warning."
STEP 1/4: FROM docker.io/nvidia/cuda:12.2.0-base-ubuntu22.04
STEP 2/4: USER root
--> a85047a34ada
STEP 3/4: RUN df -h
time="2025-01-24T10:56:16Z" level=warning msg="Implicit hook directories are deprecated; set --hooks-dir=\"/usr/share/containers/oci/hooks.d\" explicitly to continue to load ociHooks from this directory"
time="2025-01-24T10:56:16Z" level=warning msg="Implicit hook directories are deprecated; set --hooks-dir=\"/etc/containers/oci/hooks.d\" explicitly to continue to load ociHooks from this directory"
Filesystem Size Used Avail Use% Mounted on
fuse-overlayfs 1.8T 219G 1.6T 13% /
tmpfs 64M 0 64M 0% /dev
/dev/md127 1.8T 219G 1.6T 13% /dev/termination-log
shm 64M 84K 64M 1% /dev/shm
tmpfs 504G 12K 504G 1% /proc/driver/nvidia
overlay 1.8T 219G 1.6T 13% /proc/acpi
tmpfs 504G 0 504G 0% /sys/fs/cgroup
--> 6f29b1abb0b9
STEP 4/4: RUN nvidia-smi
time="2025-01-24T10:56:19Z" level=warning msg="Implicit hook directories are deprecated; set --hooks-dir=\"/usr/share/containers/oci/hooks.d\" explicitly to continue to load ociHooks from this directory"
time="2025-01-24T10:56:19Z" level=warning msg="Implicit hook directories are deprecated; set --hooks-dir=\"/etc/containers/oci/hooks.d\" explicitly to continue to load ociHooks from this directory"
/bin/sh: 1: nvidia-smi: not found
subprocess exited with status 127
subprocess exited with status 127
Error: building at STEP "RUN nvidia-smi": exit status 127
I'm using the following versions:
$ buildah -v
buildah version 1.38.0 (image-spec 1.1.0, runtime-spec 1.2.0)
$ podman -v
podman version 5.3.1
So in comparison to the issue mentioned above, this is more of a "slient" failure since the build executor is created without complaining about the device.
Any hints on how to fix this? Thanks a lot in advance!
PS:
Here's also the output from nvidia-ctk:
$ nvidia-ctk cdi list
time="2025-01-24T10:49:36Z" level=info msg="Found 17 CDI devices"
nvidia.com/gpu=0
nvidia.com/gpu=1
nvidia.com/gpu=2
nvidia.com/gpu=3
nvidia.com/gpu=4
nvidia.com/gpu=5
nvidia.com/gpu=6
nvidia.com/gpu=7
nvidia.com/gpu=GPU-12efe259-604a-6c44-c58c-4178d4c35d3e
nvidia.com/gpu=GPU-133c740e-bad1-bde5-325d-4a49eec5dfae
nvidia.com/gpu=GPU-24db9a46-a825-5e6b-2950-07d51fb79aed
nvidia.com/gpu=GPU-4bd7a46e-8df1-3a65-5038-7b3a4baec73c
nvidia.com/gpu=GPU-8ed47b3f-a71e-71c7-18a6-bd37bf1cde8a
nvidia.com/gpu=GPU-c06d0cc5-b033-4cb4-977c-4907b6f50f5e
nvidia.com/gpu=GPU-d8afe118-eb71-e279-b65c-bc4d1640c63a
nvidia.com/gpu=GPU-dd122b28-6236-854d-b42f-6bd45143d55b
nvidia.com/gpu=all