podlet icon indicating copy to clipboard operation
podlet copied to clipboard

Can't convert compose service with CDI device

Open rany2 opened this issue 1 year ago • 11 comments

Consider the following service:

  jellyfin:
    image: docker.io/jellyfin/jellyfin:latest
    container_name: jellyfin
    restart: unless-stopped
    #user: 973:973  # media:media
    group_add:
      - video
    ports:
      - 127.0.0.1:8096:8096
    volumes:
      - ./jellyfin/config:/config
      - ./jellyfin/cache:/cache
      - /mnt/hdd/media:/data/media
    devices:
      - nvidia.com/gpu=all
    security_opt:
      - label=disable

Ignore the fact that the user entry would fail with podlet due to https://github.com/containers/podlet/issues/106, another validation failure is triggered by the devices entry.

Error: 
   0: error converting compose file
   1: error reading compose file
   2: File `/compose.yml` is not a valid compose file
   3: services.jellyfin.devices[0]: device must have a container path at line 45 column 9

Location:
   src/cli/compose.rs:203

Backtrace omitted. Run with RUST_BACKTRACE=1 environment variable to display it.
Run with RUST_BACKTRACE=full to include source snippets.

rany2 avatar Sep 15 '24 06:09 rany2

For someone facing this issue, the following workaround seems like it works OK.

Define a new runtime at /etc/containers/containers.conf.d/50-nvidia-runtime.conf:

[engine.runtimes]
nvidia = ["/usr/bin/nvidia-container-runtime"]

Use runtime: nvidia in the compose service instead of the CDI device.

  jellyfin:
    image: docker.io/jellyfin/jellyfin:latest
    container_name: jellyfin
    restart: always
    #user: 973:973  # media:media
    runtime: nvidia
    group_add:
      - video
    ports:
      - 127.0.0.1:8096:8096
    volumes:
      - ./jellyfin/config:/config
      - ./jellyfin/cache:/cache
      - /mnt/hdd/media:/data/media
    security_opt:
      - label=disable
    labels:
      - io.containers.autoupdate=registry

I haven't tested the generate quadlet service but it returns the following which seems correct (ignore the volume paths, I didn't pass --absolute-host-paths):

# jellyfin.container
[Container]
AutoUpdate=registry
ContainerName=jellyfin
Image=docker.io/jellyfin/jellyfin:latest
PodmanArgs=--group-add video
PublishPort=127.0.0.1:8096:8096
SecurityLabelDisable=true
Volume=./jellyfin/config:/config
Volume=./jellyfin/cache:/cache
Volume=/mnt/hdd/media:/data/media
GlobalArgs=--runtime nvidia

[Service]
Restart=always

rany2 avatar Sep 16 '24 14:09 rany2

According to the Compose Specification, devices must be in the form HOST_PATH:CONTAINER_PATH[:CGROUP_PERMISSIONS].

k9withabone avatar Sep 21 '24 06:09 k9withabone

Specifically for Podman, there is podman run --gpus (added in Podman v5.0.0), so you could add PodmanArgs=--gpus all to the generated .container Quadlet file.

k9withabone avatar Sep 21 '24 06:09 k9withabone

According to the Compose Specification, devices must be in the form HOST_PATH:CONTAINER_PATH[:CGROUP_PERMISSIONS].

Shouldn't the spec be corrected given that CDI devices exist? I think CDI devices are a relatively recent standard (not older than 5 years) and it's only very recently that Nvidia started recommending it for Podman users. It seems like a case of the spec being out of date.

Docker also supports CDI devices but I'm not sure if their docker-compose is doing this same type of validation.

IMO it should be valid given that both podman run and docker run accept it as valid.

rany2 avatar Sep 21 '24 09:09 rany2

Specifically for Podman, there is podman run --gpus (added in Podman v5.0.0), so you could add PodmanArgs=--gpus all to the generated .container Quadlet file.

I actually preferred the runtime approach as it doesn't require me to create some kind of package update hook/systemd service that keeps the CDI yaml file up-to-date. The issue with CDI is that the file needs to be updated everytime Cuda or the Nvidia driver is updated.

Either way, this issue doesn't impact me anymore but I kept the issue open as it seems a simple issue to fix. Someone might need CDI devices for some other vendor and wouldn't be able to use the runtime workaround.

(Edit: --gpus=all just adds the Nvidia CDI devices behind the scenes. https://github.com/containers/podman/pull/21180)

rany2 avatar Sep 21 '24 09:09 rany2

Thanks for the information! I haven't tried to use a GPU in a container myself and hadn't heard of CDI before.

Shouldn't the spec be corrected given that CDI devices exist?

Probably. You should create an issue in the compose-spec repo since you understand this better than I do.

IMO it should be valid given that both podman run and docker run accept it as valid.

Is there documentation on this? I can't find anything about CDI in the docker-run(1) or podman-run(1) man pages.

k9withabone avatar Sep 21 '24 23:09 k9withabone

Is there documentation on this? I can't find anything about CDI in the docker-run(1) or podman-run(1) man pages.

In the podman-run man page, the reference to CDI devices is subtle:

--device=host-device[:container-device][:permissions]

With CDI devices, container-device and permissions needs to be omitted. It is strange it isn't mentioned more directly though.

rany2 avatar Sep 22 '24 07:09 rany2

I made a ticket here: https://github.com/compose-spec/compose-spec/issues/532

rany2 avatar Sep 22 '24 10:09 rany2

In the podman-run man page, the reference to CDI devices is subtle:

--device=host-device[:container-device][:permissions]

With CDI devices, container-device and permissions needs to be omitted. It is strange it isn't mentioned more directly though.

Are you sure that's a reference to CDI devices? Leaving off the container-device instructs Podman to mount the device in the same place in the container as the host.

I get that Podman and Docker do support CDI devices. I'm just hesitant to add it to Podlet / compose_spec without clear documentation to reference.

k9withabone avatar Sep 23 '24 01:09 k9withabone

It's actually not, I checked the man page's git history and this predates CDI.

rany2 avatar Sep 23 '24 04:09 rany2

@k9withabone The spec was recently updated to accommodate CDI: https://github.com/compose-spec/compose-spec/pull/574

rany2 avatar May 06 '25 08:05 rany2