FFmpeg NVENC fails in pods unless `/dev/nvidia#` index matches GPU index from `nvidia-smi` — with `deviceListStrategy: volume-mounts`

Open ghost opened this issue 6 months ago • 7 comments

🐛 Describe the bug

When deploying GPU-bound pods using the NVIDIA device plugin (nvidia-device-plugin Helm chart v0.17.1), FFmpeg NVENC fails inside the container unless the assigned GPU is mounted at the path /dev/nvidiaN where N matches its index in nvidia-smi.

This issue occurs only when using deviceListStrategy: volume-mounts, which is required for secure GPU isolation in our multi-tenant environment. Using envvar is not an option, as users can override NVIDIA_VISIBLE_DEVICES in untrusted Docker images.

As a result, only pods where the assigned GPU's nvidia-smi index matches the container path /dev/nvidiaN succeed. All others fail with unsupported device errors in FFmpeg.

🛠️ Helm values

deviceIDStrategy: uuid
deviceListStrategy: volume-mounts
runtimeClassName: nvidia

🧠 Root cause

NVENC appears to rely on the assumption that:

/dev/nvidiaN <—> GPU with index N from `nvidia-smi`

If this alignment is broken (e.g. GPU with index: 0 is mounted as /dev/nvidia5), the encoder fails:

[h264_nvenc @ 0x637317ea8e80] OpenEncodeSessionEx failed: unsupported device (2): (no details)
[h264_nvenc @ 0x637317ea8e80] No capable devices found

This behavior is reproducible and consistent across all tested environments.

🖥️ Host configuration

6× NVIDIA RTX 4090 (UUID-assigned, known-good hardware)
Host /dev/nvidia[0-5] layout matches nvidia-smi output
nvidia-smi, CUDA, and NVENC work fine directly on host
Issue only occurs inside container when mount path/index diverge from nvidia-smi

✅ Working pod example

GPU UUID: GPU-46b5dd79-...
nvidia-smi index: 0
Mounted as: /dev/nvidia0
✅ ffmpeg -c:v h264_nvenc works

❌ Failing pod example

GPU UUID: GPU-dada647b-...
nvidia-smi index: 0
Mounted as: /dev/nvidia5
❌ ffmpeg -c:v h264_nvenc fails with:

[h264_nvenc @ 0x637317ea8e80] OpenEncodeSessionEx failed: unsupported device (2): (no details)
[h264_nvenc @ 0x637317ea8e80] No capable devices found

🔍 Additional observations

All expected character devices (nvidia[0-9], nvidiactl, uvm, etc.) are present inside the pod.
The mounted /dev/nvidiaX files have correct major/minor numbers.
The issue only depends on the alignment between nvidia-smi index and the mounted path.
The Device Minor: in /proc/driver/nvidia/gpus/.../information does not determine NVENC success, only the mount path does.

✅ Expected behavior

All GPUs assigned to a container should be fully usable via NVENC — regardless of physical or logical index — as long as the device is properly mounted.

The device plugin should ensure that /dev/nvidiaN always maps to the GPU with nvidia-smi index N, or NVENC workloads will fail.

🌎 Environment

Host OS: Ubuntu 22.04
GPUs: 6× NVIDIA RTX 4090
Container runtime: containerd
Kubernetes: v1.32.x (K3s)
NVIDIA Driver: 570.133.20 (also tested with 575)
NVIDIA device plugin: v0.17.1 (Helm)
nvidia-container-runtime: 3.14.0-1
nvidia-container-toolkit: 1.17.6-1
NVIDIA_DRIVER_CAPABILITIES: compute,video,utility,graphics,display (set in the deployment image)
FFmpeg: NVENC-enabled build (confirmed working directly on host)

🧪 Steps to reproduce

Deploy multiple pods with:

resources:
  limits:
    nvidia.com/gpu: 1

Inside each pod, run:

nvidia-smi --query-gpu=gpu_uuid,index,name --format=csv,noheader
ls -l /dev/nvidia[0-9]
ffmpeg -hide_banner -f lavfi -i testsrc=duration=3:size=1280x720:rate=30 -c:v h264_nvenc -y /tmp/test.mp4

Observe:
- If /dev/nvidiaN matches the index: N reported by nvidia-smi, encoding works.
- If not, FFmpeg fails.

💡 Suggested improvement

Ensure the device plugin mounts GPU devices inside the pod at the /dev/nvidiaN path where N is the GPU's index reported by nvidia-smi.

This will restore NVENC compatibility and likely benefit other workloads that rely on this path/index alignment.

🚫 Partial workaround

None identified.

Detecting the mismatch inside user space (via nvidia-smi + ls -l /dev/nvidia*) lets us fail fast, but does not resolve the root problem — NVENC will still fail to initialize.

Jun 06 '25 13:06 ghost