gpu-operator
gpu-operator copied to clipboard
Problems running the GPU operator on k3s
Our product is IBM Edge Application Manager (IEAM). It manages containerized workloads on small devices, like the NVIDIA nano, and TX2. It also manages containerized workloads on Kubernetes clusters. We support OCP kubernetes clusters and k3s kubernetes clusters (among others) for IEAM. We would like to be able to support NVIDIA GPUs on all of our supported platforms. However, we are currently trying to run the NVIDIA GPU operator on k3s and we are running into problems.
Hardware: 96 core Xeon, 200GB RAM OS/Distro: Linux 5.3, ubuntu 18.04.04 Docker 19.03.12 Fresh install of the latest k3s
We followed the instructions (https://github.com/NVIDIA/gpu-operator#install-helm) on the installation but 3 of the pods do not come up. Note, as per the preqs, that we invoked the helm install as follows as I saw that nfd pods existed:
sudo helm install --devel --set nfd.enabled=false nvidia/gpu-operator --wait --generate-name
For the pod logs of the two that have the CrashLoopBackOff error show:
Attached are the pod descriptions for the three pods.
This seems to be a similar issue to https://access.redhat.com/solutions/5089121.
Here is our platform configuration info:
(Please see the 2 attached files). pods.txt log-nvidia-dcgm-exporter-48rqg.txt
We received some help on this from Anurag Guda and Anudeep Nallamothu but we remain blocked and they suggested I should raise a github issue for this.
We're excited to see you trying GPU operator. Sorry but we do not support k3s yet, but we do have it in our future plans.
Looking at the error messages we can see problems with the driver setup or the runtime. We will provide you with updates as soon as we debug these problems and support k3s.
@nvjmayo any update for this, it has been quite awhile :)
+1
@ElisaMeng @corbanvilla Can you try with v1.7.0
and verify if you are still seeing this issue?