EGL Initialization Failure with K8s Device Plugin
Overview
We're running an application that uses Nvidia graphics capabilities and are trying to get this running in K8s. With the 0.17 version of the device plugin, we cannot initialize EGL that our application relies on to access the Nvidia device.
Test cases:
- In K8s pod with GPU (
nvidia-smiworks, buteglinfofails). - In container created on same host with
docker(nvidia-smiworks andeglinfoworks).
Expected behavior:
Initialize EGL to leverage Nvidia GPU in K8s pod. eglinfo should return information about the Nvidia device.
Reproduction
Pre-requisites
- K8s cluster with Nvidia GPUs and latest Nvidia device plugin (0.17)
Steps
- Create a pod
apiVersion: v1
kind: Pod
metadata:
name: nvidia-test-pod
namespace: default
spec:
tolerations:
- key: "nvidia.com/gpu"
operator: "Exists"
effect: "NoSchedule"
containers:
- name: nvidia-test-container
image: nvidia/opengl:1.2-glvnd-runtime
resources:
limits:
nvidia.com/gpu: 1
env:
- name: NVIDIA_DRIVER_CAPABILITIES
value: "all"
- name: NVIDIA_VISIBLE_DEVICES
value: "all"
command:
- /bin/bash
- "-c"
- "sleep infinity"
- Get a shell in the pod and install eglinfo.
- Run
eglinfo. Notice the output
Device platform:
eglinfo: eglInitialize failed
We would expect the output to be
Device platform:
EGL API version: 1.5
EGL vendor string: NVIDIA
EGL version string: 1.5
EGL client APIs: OpenGL_ES OpenGL
EGL extensions string:
My notes
I think it is likely that this code path is not being exercised by the plugin, which is being exercised by the nvidia-container-toolkit: https://github.com/NVIDIA/nvidia-container-toolkit/blob/main/internal/discover/graphics.go#L52
Could you provide information on how the device plugin is configured?
What is the container runtime used on your K8s cluster?
Could you provide information on how the device plugin is configured?
kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.17.0/deployments/static/nvidia-device-plugin.yml
What is the container runtime used on your K8s cluster?
We have seen the issue both with containerd and cri-o. We are primarily interested in cri-o
I am also seeing the same issue