gpu-operator icon indicating copy to clipboard operation
gpu-operator copied to clipboard

rhel 8.x support for GPU operator

Open prpaul opened this issue 3 years ago • 15 comments

Wanted to check if RHEL 8.2 is supported by GPU operator 1.9.0

If no support is available, in which version we can expect RHEL 8.2 support and when?

prpaul avatar Dec 06 '21 18:12 prpaul

@prpaul no, we don't support RHEL 8.x worker nodes, but only CoreOS. There is no plan to support RHEL worker nodes in the short term.

shivamerla avatar Dec 06 '21 18:12 shivamerla

@shivamerla So if there is no planned support or roadmap, what is the alternative to GPU operator in the field?

Most of the deployments in production that we have seen will have RHEL 8 so what would you suggest should be the way of deployment on Kubernetes?

tusharrobin avatar Dec 08 '21 18:12 tusharrobin

@tusharrobin are you referring to RHEL worker nodes in OCP or using upstream K8s?

On OCP, we could still use GPU operator but they need to build private driver container from here: https://gitlab.com/nvidia/container-images/driver/-/tree/master/rhel8 and reference it while installing GPU operator. Alternatively driver can be directly installed on RHEL nodes and pass driver.enabled=false with GPU Operator install.

With upstream K8s, other than the driver itself, need to make sure ubi8 variant of images are installed for GPU operator components using Helm.

helm install gpu-operator nvidia/gpu-operator --version=1.9.0 --set 
 operator.defaultRuntime=crio,toolkit.version=1.7.2-ubi8,dcgmExporter.version=2.3.1-2.6.0-ubi8,dcgm.version=2.3.1-ubi8,migManager.version=v0.2.0-ubi8

Also, --set driver.enabled=false when driver is pre-installed on each RHEL node.

But, this configuration will not be officially qualified or supported by the GPU Operator.

shivamerla avatar Dec 08 '21 19:12 shivamerla

@shivamerla Even with helm install gpu-operator nvidia/gpu-operator --version=1.9.0 --set operator.defaultRuntime=containerd,toolkit.version=1.7.2-ubi8,dcgmExporter.version=2.3.1-2.6.0-ubi8,dcgm.version=2.3.1-ubi8,migManager.version=v0.2.0-ubi8 --set driver.enabled=false

I am still seeing Warning FailedCreatePodSandBox 1s (x2 over 12s) kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = Exception calling application: ErrorUnknown:StatusCode.UNKNOWN:RuntimeHandler "nvidia" not supported

I see runtime class is available though

[root@priyanko-bnp-mig1 gpu-operator]# kubectl get runtimeclass NAME HANDLER AGE nvidia nvidia 71s

This is upstream Kubernetes.

tusharrobin avatar Dec 09 '21 01:12 tusharrobin

@tusharrobin Can you show the status of all pods? Container toolkit pod has to be running for nvidia runtime to be configured with containerd. Also, previously there was a typo with version, it should be v1.9.0 with helm install. Based on the command you mentioned, i am assuming driver is pre-installed?

shivamerla avatar Dec 09 '21 01:12 shivamerla

@shivamerla I was able to install using the options after removing the defaultRuntime option as I was using docker. Thanks for all your help !

Is there a reason that RHEL 8 is not in GPU operator's roadmap? Since most of the deployments are moving to RHEL8/Rocky8, why would you not consider that as one of the supported platforms?

tusharrobin avatar Dec 09 '21 06:12 tusharrobin

FYI NVAIE says it supports rhel8.4 on 1.9.1 operator, huh? :)

yug0slav avatar Feb 24 '22 15:02 yug0slav

@tusharrobin we are looking into support for additional operating systems. Do I understand correctly that you use k8s with RHEL 8 and containerd?

@yug0slav Please note that NVIDIA AI Enterprise supports RHEL 8.4 to run containers but without k8s. NVIDIA AI Enterprise doesn't support RHEL worker nodes with the GPU Operator.

MrBoJo84 avatar Apr 15 '22 16:04 MrBoJo84

Yes, we need GPU operator support for RHEL 8 and Rocky 8.

tusharrobin avatar Apr 16 '22 06:04 tusharrobin

IBM Cloud Openshift also needs support for RHEL 8

relyt0925 avatar Jun 13 '22 01:06 relyt0925

@MrBoJo84 To add, IBM Cloud Openshift uses cri-o for our container runtime.

KodieGlosserIBM avatar Jun 17 '22 17:06 KodieGlosserIBM

Hi, Most of the corporates are using RHEL 8.x and even 9.x in the near future. We're currently struggling to install nvidia-driver on airgap environments and gpu-operator is the complete solution for us. I think it's a very useful and necessary support matrix.

snirkatriel avatar Jul 06 '22 13:07 snirkatriel

@snirkatriel it would help if you could share the exact stack that you are looking support for. Is it with OpenShift or Kubernetes? Is it with containerd or crio? Which versions?

MrBoJo84 avatar Jul 06 '22 13:07 MrBoJo84

@snirkatriel it would help if you could share the exact stack that you are looking support for. Is it with OpenShift or Kubernetes? Is it with containerd or crio? Which versions?

Sure. We're using Kubernetes (k3s) with containerd runtime, we're looking into RHEL 8.3,8.4,8.6 and so on.

snirkatriel avatar Jul 06 '22 13:07 snirkatriel