k8s-device-plugin icon indicating copy to clipboard operation
k8s-device-plugin copied to clipboard

FFmpeg NVENC fails in pods unless `/dev/nvidia#` index matches GPU index from `nvidia-smi` โ€” with `deviceListStrategy: volume-mounts`

Open ghost opened this issue 6 months ago โ€ข 7 comments

๐Ÿ› Describe the bug

When deploying GPU-bound pods using the NVIDIA device plugin (nvidia-device-plugin Helm chart v0.17.1), FFmpeg NVENC fails inside the container unless the assigned GPU is mounted at the path /dev/nvidiaN where N matches its index in nvidia-smi.

This issue occurs only when using deviceListStrategy: volume-mounts, which is required for secure GPU isolation in our multi-tenant environment. Using envvar is not an option, as users can override NVIDIA_VISIBLE_DEVICES in untrusted Docker images.

As a result, only pods where the assigned GPU's nvidia-smi index matches the container path /dev/nvidiaN succeed. All others fail with unsupported device errors in FFmpeg.


๐Ÿ› ๏ธ Helm values

deviceIDStrategy: uuid
deviceListStrategy: volume-mounts
runtimeClassName: nvidia

๐Ÿง  Root cause

NVENC appears to rely on the assumption that:

/dev/nvidiaN <โ€”> GPU with index N from `nvidia-smi`

If this alignment is broken (e.g. GPU with index: 0 is mounted as /dev/nvidia5), the encoder fails:

[h264_nvenc @ 0x637317ea8e80] OpenEncodeSessionEx failed: unsupported device (2): (no details)
[h264_nvenc @ 0x637317ea8e80] No capable devices found

This behavior is reproducible and consistent across all tested environments.


๐Ÿ–ฅ๏ธ Host configuration

  • 6ร— NVIDIA RTX 4090 (UUID-assigned, known-good hardware)
  • Host /dev/nvidia[0-5] layout matches nvidia-smi output
  • nvidia-smi, CUDA, and NVENC work fine directly on host
  • Issue only occurs inside container when mount path/index diverge from nvidia-smi

โœ… Working pod example

  • GPU UUID: GPU-46b5dd79-...
  • nvidia-smi index: 0
  • Mounted as: /dev/nvidia0
  • โœ… ffmpeg -c:v h264_nvenc works

โŒ Failing pod example

  • GPU UUID: GPU-dada647b-...
  • nvidia-smi index: 0
  • Mounted as: /dev/nvidia5
  • โŒ ffmpeg -c:v h264_nvenc fails with:
[h264_nvenc @ 0x637317ea8e80] OpenEncodeSessionEx failed: unsupported device (2): (no details)
[h264_nvenc @ 0x637317ea8e80] No capable devices found

๐Ÿ” Additional observations

  • All expected character devices (nvidia[0-9], nvidiactl, uvm, etc.) are present inside the pod.
  • The mounted /dev/nvidiaX files have correct major/minor numbers.
  • The issue only depends on the alignment between nvidia-smi index and the mounted path.
  • The Device Minor: in /proc/driver/nvidia/gpus/.../information does not determine NVENC success, only the mount path does.

โœ… Expected behavior

All GPUs assigned to a container should be fully usable via NVENC โ€” regardless of physical or logical index โ€” as long as the device is properly mounted.

The device plugin should ensure that /dev/nvidiaN always maps to the GPU with nvidia-smi index N, or NVENC workloads will fail.


๐ŸŒŽ Environment

  • Host OS: Ubuntu 22.04
  • GPUs: 6ร— NVIDIA RTX 4090
  • Container runtime: containerd
  • Kubernetes: v1.32.x (K3s)
  • NVIDIA Driver: 570.133.20 (also tested with 575)
  • NVIDIA device plugin: v0.17.1 (Helm)
  • nvidia-container-runtime: 3.14.0-1
  • nvidia-container-toolkit: 1.17.6-1
  • NVIDIA_DRIVER_CAPABILITIES: compute,video,utility,graphics,display (set in the deployment image)
  • FFmpeg: NVENC-enabled build (confirmed working directly on host)

๐Ÿงช Steps to reproduce

  1. Deploy multiple pods with:

    resources:
      limits:
        nvidia.com/gpu: 1
    
  2. Inside each pod, run:

    nvidia-smi --query-gpu=gpu_uuid,index,name --format=csv,noheader
    ls -l /dev/nvidia[0-9]
    ffmpeg -hide_banner -f lavfi -i testsrc=duration=3:size=1280x720:rate=30 -c:v h264_nvenc -y /tmp/test.mp4
    
  3. Observe:

    • If /dev/nvidiaN matches the index: N reported by nvidia-smi, encoding works.
    • If not, FFmpeg fails.

๐Ÿ’ก Suggested improvement

Ensure the device plugin mounts GPU devices inside the pod at the /dev/nvidiaN path where N is the GPU's index reported by nvidia-smi.

This will restore NVENC compatibility and likely benefit other workloads that rely on this path/index alignment.


๐Ÿšซ Partial workaround

None identified.

Detecting the mismatch inside user space (via nvidia-smi + ls -l /dev/nvidia*) lets us fail fast, but does not resolve the root problem โ€” NVENC will still fail to initialize.

ghost avatar Jun 06 '25 13:06 ghost