gpu-operator
gpu-operator copied to clipboard
rhel 8.x support for GPU operator
Wanted to check if RHEL 8.2 is supported by GPU operator 1.9.0
If no support is available, in which version we can expect RHEL 8.2 support and when?
@prpaul no, we don't support RHEL 8.x worker nodes, but only CoreOS. There is no plan to support RHEL worker nodes in the short term.
@shivamerla So if there is no planned support or roadmap, what is the alternative to GPU operator in the field?
Most of the deployments in production that we have seen will have RHEL 8 so what would you suggest should be the way of deployment on Kubernetes?
@tusharrobin are you referring to RHEL worker nodes in OCP or using upstream K8s?
On OCP, we could still use GPU operator but they need to build private driver container from here: https://gitlab.com/nvidia/container-images/driver/-/tree/master/rhel8 and reference it while installing GPU operator. Alternatively driver can be directly installed on RHEL nodes and pass driver.enabled=false
with GPU Operator install.
With upstream K8s, other than the driver itself, need to make sure ubi8 variant of images are installed for GPU operator components using Helm.
helm install gpu-operator nvidia/gpu-operator --version=1.9.0 --set
operator.defaultRuntime=crio,toolkit.version=1.7.2-ubi8,dcgmExporter.version=2.3.1-2.6.0-ubi8,dcgm.version=2.3.1-ubi8,migManager.version=v0.2.0-ubi8
Also, --set driver.enabled=false
when driver is pre-installed on each RHEL node.
But, this configuration will not be officially qualified or supported by the GPU Operator.
@shivamerla Even with helm install gpu-operator nvidia/gpu-operator --version=1.9.0 --set operator.defaultRuntime=containerd,toolkit.version=1.7.2-ubi8,dcgmExporter.version=2.3.1-2.6.0-ubi8,dcgm.version=2.3.1-ubi8,migManager.version=v0.2.0-ubi8 --set driver.enabled=false
I am still seeing Warning FailedCreatePodSandBox 1s (x2 over 12s) kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = Exception calling application: ErrorUnknown:StatusCode.UNKNOWN:RuntimeHandler "nvidia" not supported
I see runtime class is available though
[root@priyanko-bnp-mig1 gpu-operator]# kubectl get runtimeclass NAME HANDLER AGE nvidia nvidia 71s
This is upstream Kubernetes.
@tusharrobin Can you show the status of all pods? Container toolkit pod has to be running for nvidia
runtime to be configured with containerd. Also, previously there was a typo with version, it should be v1.9.0
with helm install. Based on the command you mentioned, i am assuming driver is pre-installed?
@shivamerla I was able to install using the options after removing the defaultRuntime option as I was using docker. Thanks for all your help !
Is there a reason that RHEL 8 is not in GPU operator's roadmap? Since most of the deployments are moving to RHEL8/Rocky8, why would you not consider that as one of the supported platforms?
FYI NVAIE says it supports rhel8.4 on 1.9.1 operator, huh? :)
@tusharrobin we are looking into support for additional operating systems. Do I understand correctly that you use k8s with RHEL 8 and containerd?
@yug0slav Please note that NVIDIA AI Enterprise supports RHEL 8.4 to run containers but without k8s. NVIDIA AI Enterprise doesn't support RHEL worker nodes with the GPU Operator.
Yes, we need GPU operator support for RHEL 8 and Rocky 8.
IBM Cloud Openshift also needs support for RHEL 8
@MrBoJo84 To add, IBM Cloud Openshift uses cri-o for our container runtime.
Hi, Most of the corporates are using RHEL 8.x and even 9.x in the near future. We're currently struggling to install nvidia-driver on airgap environments and gpu-operator is the complete solution for us. I think it's a very useful and necessary support matrix.
@snirkatriel it would help if you could share the exact stack that you are looking support for. Is it with OpenShift or Kubernetes? Is it with containerd or crio? Which versions?
@snirkatriel it would help if you could share the exact stack that you are looking support for. Is it with OpenShift or Kubernetes? Is it with containerd or crio? Which versions?
Sure. We're using Kubernetes (k3s) with containerd runtime, we're looking into RHEL 8.3,8.4,8.6 and so on.