k8s-device-plugin icon indicating copy to clipboard operation
k8s-device-plugin copied to clipboard

EGL Initialization Failure with K8s Device Plugin

Open ryan-brigden-ai opened this issue 11 months ago • 3 comments

Overview

We're running an application that uses Nvidia graphics capabilities and are trying to get this running in K8s. With the 0.17 version of the device plugin, we cannot initialize EGL that our application relies on to access the Nvidia device.

Test cases:

  • In K8s pod with GPU (nvidia-smi works, but eglinfo fails).
  • In container created on same host with docker (nvidia-smi works and eglinfo works).

Expected behavior:

Initialize EGL to leverage Nvidia GPU in K8s pod. eglinfo should return information about the Nvidia device.

Reproduction

Pre-requisites

  • K8s cluster with Nvidia GPUs and latest Nvidia device plugin (0.17)

Steps

  1. Create a pod
apiVersion: v1
kind: Pod
metadata:
  name: nvidia-test-pod
  namespace: default
spec:
  tolerations:
    - key: "nvidia.com/gpu"
      operator: "Exists"
      effect: "NoSchedule"
  containers:
    - name: nvidia-test-container
      image: nvidia/opengl:1.2-glvnd-runtime
      resources:
        limits:
          nvidia.com/gpu: 1
      env:
        - name: NVIDIA_DRIVER_CAPABILITIES
          value: "all"
        - name: NVIDIA_VISIBLE_DEVICES
          value: "all"
      command:
        - /bin/bash
        - "-c"
        - "sleep infinity"
  1. Get a shell in the pod and install eglinfo.
  2. Run eglinfo. Notice the output
Device platform:
eglinfo: eglInitialize failed

We would expect the output to be

Device platform:
EGL API version: 1.5
EGL vendor string: NVIDIA
EGL version string: 1.5
EGL client APIs: OpenGL_ES OpenGL
EGL extensions string:

My notes

I think it is likely that this code path is not being exercised by the plugin, which is being exercised by the nvidia-container-toolkit: https://github.com/NVIDIA/nvidia-container-toolkit/blob/main/internal/discover/graphics.go#L52

ryan-brigden-ai avatar Jan 28 '25 17:01 ryan-brigden-ai

Could you provide information on how the device plugin is configured?

What is the container runtime used on your K8s cluster?

elezar avatar Jan 28 '25 18:01 elezar

Could you provide information on how the device plugin is configured?

kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.17.0/deployments/static/nvidia-device-plugin.yml

What is the container runtime used on your K8s cluster?

We have seen the issue both with containerd and cri-o. We are primarily interested in cri-o

ryan-brigden-ai avatar Jan 28 '25 21:01 ryan-brigden-ai

I am also seeing the same issue

bassrock avatar Jun 05 '25 15:06 bassrock