gpu-operator icon indicating copy to clipboard operation
gpu-operator copied to clipboard

Problems running the GPU operator on k3s

Open TheMosquito opened this issue 4 years ago • 4 comments

Our product is IBM Edge Application Manager (IEAM). It manages containerized workloads on small devices, like the NVIDIA nano, and TX2. It also manages containerized workloads on Kubernetes clusters. We support OCP kubernetes clusters and k3s kubernetes clusters (among others) for IEAM. We would like to be able to support NVIDIA GPUs on all of our supported platforms. However, we are currently trying to run the NVIDIA GPU operator on k3s and we are running into problems.

Hardware: 96 core Xeon, 200GB RAM OS/Distro: Linux 5.3, ubuntu 18.04.04 Docker 19.03.12 Fresh install of the latest k3s

We followed the instructions (https://github.com/NVIDIA/gpu-operator#install-helm) on the installation but 3 of the pods do not come up. Note, as per the preqs, that we invoked the helm install as follows as I saw that nfd pods existed: sudo helm install --devel --set nfd.enabled=false nvidia/gpu-operator --wait --generate-name

image

For the pod logs of the two that have the CrashLoopBackOff error show:

image

Attached are the pod descriptions for the three pods.

This seems to be a similar issue to https://access.redhat.com/solutions/5089121.

Here is our platform configuration info:

image

(Please see the 2 attached files). pods.txt log-nvidia-dcgm-exporter-48rqg.txt

We received some help on this from Anurag Guda and Anudeep Nallamothu but we remain blocked and they suggested I should raise a github issue for this.

TheMosquito avatar Jul 22 '20 21:07 TheMosquito

We're excited to see you trying GPU operator. Sorry but we do not support k3s yet, but we do have it in our future plans.

Looking at the error messages we can see problems with the driver setup or the runtime. We will provide you with updates as soon as we debug these problems and support k3s.

nvjmayo avatar Jul 27 '20 18:07 nvjmayo

@nvjmayo any update for this, it has been quite awhile :)

ElisaMeng avatar Dec 30 '20 18:12 ElisaMeng

+1

corbanvilla avatar Jan 25 '21 00:01 corbanvilla

@ElisaMeng @corbanvilla Can you try with v1.7.0 and verify if you are still seeing this issue?

shivamerla avatar May 25 '21 16:05 shivamerla