gpu-operator
gpu-operator copied to clipboard
NVIDIA GPU-Driver support for RHEL8
The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.
1. Quick Debug Checklist
- [no] Are you running on an Ubuntu 18.04 node?
- [no (Openshift 4.10+)] Are you running Kubernetes v1.13+?
- [X] Are you running Docker (>= 18.06) or CRIO (>= 1.13+)?
1. Issue or feature description
Adding support for the RHEL8 operating system. Starting with 8.6, shortly moving to 8.7 and so on. Currently there is only RHEL7 and RHEL7.9
This support is needed to continue our GPU offering on Openshift on IBM Cloud. RedHat will be dropping support for RHEL7 on Openshift version 4.10+ and moving to RHEL8. In order for us to continue offering this we will need GPU driver support for RHEL8.
Thanks!
cc @shivamerla
potentially related to: https://github.com/NVIDIA/gpu-operator/issues/291
There were some earlier discussions with Michael C. H. where the plan was to use RHCOS. I started an email thread with Michael C. H. so that we can discuss this request.
Both will be active in our environments.
Hi, is there any progress with this?
@shivamerla @MrBoJo84
I believe Michael has emailed and confirmed the use case with us + you. Let us know if you need anything else. It would be wonderful if we can get a status on that piece.
@KodieGlosserIBM @relyt0925 We will publish driver images with RHEL8 tags during our September release.
Awesome! Thanks @shivamerla
I use the Dockerfile from https://gitlab.com/nvidia/container-images/driver/-/tree/master/rhel8 to build driver image with the following command:
docker build -t hsc/driver:510.47.03-rhel8.4 --build-arg CUDA_VERSION=11.6.0 --build-arg TARGETARCH=x86_64 --build-arg DRIVER_VERSION=510.47.03 --no-cache .
The image can be built successfully.
Then I applied the driver image to my k3s cluster environment, pod nvidia-driver-daemonset cannot start, error as below:
Normal Created 16s (x2 over 17s) kubelet Created container nvidia-driver-ctr Warning Failed 16s (x2 over 17s) kubelet Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: exec: "nvidia-driver": executable file not found in $PATH: unknown Warning BackOff 9s (x3 over 15s) kubelet Back-off restarting failed container
Have you encountered this issue before?
Thanks.
@KodieGlosserIBM @relyt0925 We will publish driver images with RHEL8 tags during our September release.
When will driver images with RHEL8 tags be published?
Thanks.
@carlwang87 These should be available during week of 9/26.
@shivamerla I built RHEL8 driver image with the code from https://gitlab.com/nvidia/container-images/driver/-/tree/master/rhel8 .
Then I deployed driver image on k3s cluster environment, and the environment is air-gap, and pod logs:
And found yum, dnf in the Nvidia-driver.
So I want to know that whether the driver images must be connect to the internet. In the release version, will this issue be fixed?
Thanks.
@shivamerla Now that RHEL8 has bumped to the 8.7 kernel the RHEL8.6 build will not work. Could we get a build available for the 8.7 kernel?
@KodieGlosserIBM RHEL 8.6 image should work with 8.7 as well as we dynamically figure out the RHEL version and pull packages accordingly. Can you try to deploy using the image digest instead of version tag?
@shivamerla I was going to try re-upload it with the 8.7 tag to test it out. I'm not sure how to use the image digest? The GPU operator automatically adds the _RHEL8.6/RHEL8.7
tags and we have many instructions on using the tag way. Once I get my cluster up I'll check it out.
@KodieGlosserIBM if we use driver.version=sha256:<checksum>
then GPU operator doesn't append OS version tag and use the image with digest that was specified.
@shivamerla thank you. Using the sha does indeed work. I can work to get this documented, but for the sake of getting our customers to still be functional, can we tag the rhel8.6 images with rhel8.7 as well?
@shivamerla bump on the above please ^
@KodieGlosserIBM 8.7 tags are published for latest R450/470/510/515/525 drivers.
@shivamerla thank you sir!
@shivamerla do you know when/if the rhel 8.9 tags will be published? it seems to be missing from the driver page and even if we switch to a 8.8 digest, the subscription manager fails fetching the various repos
Red Hat Enterprise Linux 8 for x86_64 - BaseOS 293 B/s | 14 B 00:00
Errors during downloading metadata for repository 'rhel-8-for-x86_64-baseos-rpms':
- Status code: 404 for https://rhha02.updates.us-south.iaas.service.networklayer.com/pulp/repos/customer/Library/content/dist/rhel8/8.9/x86_64/baseos/os/repodata/repomd.xml (IP: 161.26.112.29)
Error: Failed to download metadata for repo 'rhel-8-for-x86_64-baseos-rpms': Cannot download repomd.xml: Cannot download repodata/repomd.xml: All mirrors were tried
+ dnf config-manager --set-disabled rhel-8-for-x86_64-baseos-eus-rpms
Updating Subscription Management repositories.
Unable to read consumer identity
Subscription Manager is operating in container mode.
Installing Linux kernel headers...
+ echo 'Installing Linux kernel headers...'
+ dnf -q -y --releasever=8.9 install kernel-headers-4.18.0-477.27.1.el8_8.x86_64 kernel-devel-4.18.0-477.27.1.el8_8.x86_64
Error: Failed to download metadata for repo 'rhel-8-for-x86_64-baseos-rpms': Cannot download repomd.xml: Cannot download repodata/repomd.xml: All mirrors were tried