gpu-operator icon indicating copy to clipboard operation
gpu-operator copied to clipboard

NVIDIA GPU-Driver support for RHEL8

Open KodieGlosserIBM opened this issue 2 years ago • 12 comments

The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.

1. Quick Debug Checklist

  • [no] Are you running on an Ubuntu 18.04 node?
  • [no (Openshift 4.10+)] Are you running Kubernetes v1.13+?
  • [X] Are you running Docker (>= 18.06) or CRIO (>= 1.13+)?

1. Issue or feature description

Adding support for the RHEL8 operating system. Starting with 8.6, shortly moving to 8.7 and so on. Currently there is only RHEL7 and RHEL7.9

image

This support is needed to continue our GPU offering on Openshift on IBM Cloud. RedHat will be dropping support for RHEL7 on Openshift version 4.10+ and moving to RHEL8. In order for us to continue offering this we will need GPU driver support for RHEL8.

Thanks!

KodieGlosserIBM avatar Jun 09 '22 14:06 KodieGlosserIBM

cc @shivamerla

relyt0925 avatar Jun 10 '22 02:06 relyt0925

potentially related to: https://github.com/NVIDIA/gpu-operator/issues/291

relyt0925 avatar Jun 13 '22 01:06 relyt0925

There were some earlier discussions with Michael C. H. where the plan was to use RHCOS. I started an email thread with Michael C. H. so that we can discuss this request.

MrBoJo84 avatar Jun 13 '22 21:06 MrBoJo84

Both will be active in our environments.

relyt0925 avatar Jun 15 '22 04:06 relyt0925

Hi, is there any progress with this?

snirkatriel avatar Jul 17 '22 12:07 snirkatriel

@shivamerla @MrBoJo84

I believe Michael has emailed and confirmed the use case with us + you. Let us know if you need anything else. It would be wonderful if we can get a status on that piece.

relyt0925 avatar Jul 21 '22 16:07 relyt0925

@KodieGlosserIBM @relyt0925 We will publish driver images with RHEL8 tags during our September release.

shivamerla avatar Aug 16 '22 23:08 shivamerla

Awesome! Thanks @shivamerla

KodieGlosserIBM avatar Aug 17 '22 02:08 KodieGlosserIBM

I use the Dockerfile from https://gitlab.com/nvidia/container-images/driver/-/tree/master/rhel8 to build driver image with the following command: docker build -t hsc/driver:510.47.03-rhel8.4 --build-arg CUDA_VERSION=11.6.0 --build-arg TARGETARCH=x86_64 --build-arg DRIVER_VERSION=510.47.03 --no-cache . The image can be built successfully.

Then I applied the driver image to my k3s cluster environment, pod nvidia-driver-daemonset cannot start, error as below:

Normal Created 16s (x2 over 17s) kubelet Created container nvidia-driver-ctr Warning Failed 16s (x2 over 17s) kubelet Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: exec: "nvidia-driver": executable file not found in $PATH: unknown Warning BackOff 9s (x3 over 15s) kubelet Back-off restarting failed container

Have you encountered this issue before?

Thanks.

carlwang87 avatar Sep 02 '22 13:09 carlwang87

@KodieGlosserIBM @relyt0925 We will publish driver images with RHEL8 tags during our September release.

When will driver images with RHEL8 tags be published?

Thanks.

carlwang87 avatar Sep 05 '22 07:09 carlwang87

@carlwang87 These should be available during week of 9/26.

shivamerla avatar Sep 06 '22 15:09 shivamerla

@shivamerla I built RHEL8 driver image with the code from https://gitlab.com/nvidia/container-images/driver/-/tree/master/rhel8 .

Then I deployed driver image on k3s cluster environment, and the environment is air-gap, and pod logs:

image

And found yum, dnf in the Nvidia-driver.

So I want to know that whether the driver images must be connect to the internet. In the release version, will this issue be fixed?

Thanks.

carlwang87 avatar Sep 07 '22 03:09 carlwang87

@shivamerla Now that RHEL8 has bumped to the 8.7 kernel the RHEL8.6 build will not work. Could we get a build available for the 8.7 kernel?

KodieGlosserIBM avatar Dec 06 '22 21:12 KodieGlosserIBM

@KodieGlosserIBM RHEL 8.6 image should work with 8.7 as well as we dynamically figure out the RHEL version and pull packages accordingly. Can you try to deploy using the image digest instead of version tag?

shivamerla avatar Dec 06 '22 22:12 shivamerla

@shivamerla I was going to try re-upload it with the 8.7 tag to test it out. I'm not sure how to use the image digest? The GPU operator automatically adds the _RHEL8.6/RHEL8.7 tags and we have many instructions on using the tag way. Once I get my cluster up I'll check it out.

KodieGlosserIBM avatar Dec 06 '22 22:12 KodieGlosserIBM

@KodieGlosserIBM if we use driver.version=sha256:<checksum> then GPU operator doesn't append OS version tag and use the image with digest that was specified.

shivamerla avatar Dec 06 '22 23:12 shivamerla

@shivamerla thank you. Using the sha does indeed work. I can work to get this documented, but for the sake of getting our customers to still be functional, can we tag the rhel8.6 images with rhel8.7 as well?

KodieGlosserIBM avatar Dec 07 '22 19:12 KodieGlosserIBM

@shivamerla bump on the above please ^

KodieGlosserIBM avatar Dec 09 '22 17:12 KodieGlosserIBM

@KodieGlosserIBM 8.7 tags are published for latest R450/470/510/515/525 drivers.

shivamerla avatar Dec 09 '22 17:12 shivamerla

@shivamerla thank you sir!

KodieGlosserIBM avatar Dec 09 '22 17:12 KodieGlosserIBM

@shivamerla do you know when/if the rhel 8.9 tags will be published? it seems to be missing from the driver page and even if we switch to a 8.8 digest, the subscription manager fails fetching the various repos

Red Hat Enterprise Linux 8 for x86_64 - BaseOS  293  B/s |  14  B     00:00    
Errors during downloading metadata for repository 'rhel-8-for-x86_64-baseos-rpms':
  - Status code: 404 for https://rhha02.updates.us-south.iaas.service.networklayer.com/pulp/repos/customer/Library/content/dist/rhel8/8.9/x86_64/baseos/os/repodata/repomd.xml (IP: 161.26.112.29)
Error: Failed to download metadata for repo 'rhel-8-for-x86_64-baseos-rpms': Cannot download repomd.xml: Cannot download repodata/repomd.xml: All mirrors were tried
+ dnf config-manager --set-disabled rhel-8-for-x86_64-baseos-eus-rpms
Updating Subscription Management repositories.
Unable to read consumer identity
Subscription Manager is operating in container mode.
Installing Linux kernel headers...
+ echo 'Installing Linux kernel headers...'
+ dnf -q -y --releasever=8.9 install kernel-headers-4.18.0-477.27.1.el8_8.x86_64 kernel-devel-4.18.0-477.27.1.el8_8.x86_64
Error: Failed to download metadata for repo 'rhel-8-for-x86_64-baseos-rpms': Cannot download repomd.xml: Cannot download repodata/repomd.xml: All mirrors were tried

dclain avatar Nov 27 '23 19:11 dclain