gpu-operator icon indicating copy to clipboard operation
gpu-operator copied to clipboard

NVIDIA GPU operator versions after 1.9.1 for RHEL7 are not working

Open KodieGlosserIBM opened this issue 3 years ago • 7 comments

The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.

1. Quick Debug Checklist

  • [ ] Are you running on an Ubuntu 18.04 node?
    • no, RHEL7
  • [x] Are you running Kubernetes v1.13+?
    • Openshift 4.9 which uses Kubernetes 1.22
  • [x] Are you running Docker (>= 18.06) or CRIO (>= 1.13+)?
  • yes, crio 1.22
  • [ ] Do you have i2c_core and ipmi_msghandler loaded on the nodes?
  • it has ipmi_msghandler but not i2c_core. On successful installs for version 1.9.1 it also does not have i2c_core
cat modules | grep ipmi_msghandler
ipmi_msghandler 56728 1 ipmi_si, Live 0xffffffffc0ad1000
  • [x] Did you apply the CRD (kubectl describe clusterpolicies --all-namespaces) cluster-policy.txt

1. Issue or feature description

We are installing the NVIDIA GPU Operator from the RedHat OperatorHub. If we specify a version after 1.9.1, the install fails. The nvidia-operator-validator daemonset fails to create the container Init:CreateContainerError. With the following error: container-error.txt

Note: We are using a driver version provided by NVIDIA as found here: https://catalog.ngc.nvidia.com/orgs/nvidia/containers/driver/tags nvcr.io/nvidia/driver:450.80.02-rhel7.9 The image was last built 11/04/2020, which is using a very old driver version.

I see a similarly related issue (although this looks related to ubuntu only): https://github.com/NVIDIA/nvidia-docker/issues/1447 I would expect the operator to work with this version out of the box.

2. Steps to reproduce the issue

Prereq: Openshift cluster with RHEL7 GPU nodes https://github.ibm.com/aivision/notebook/tree/master/gpu-operator#1-installation

3. Information to attach (optional if deemed irrelevant)

  • [x] kubernetes pods status: kubectl get pods --all-namespaces pods.txt
  • [x] kubernetes daemonset status: kubectl get ds --all-namespaces ds.txt
  • [x] If a pod/ds is in an error state or pending state kubectl describe pod -n NAMESPACE POD_NAME
    • see container-error in issue description
  • [x] If a pod/ds is in an error state or pending state kubectl logs -n NAMESPACE POD_NAME
  • pod never gets to a running state, so no logs.
  • [x] Output of running a container on the GPU machine: docker run -it alpine echo foo
  • the driver seems to be installing fine. I'll attach those logs
  • driver-install.txt
  • [x] Docker configuration file: cat /etc/docker/daemon.json crio-conf.txt
  • [x] Docker runtime configuration: docker info | grep runtime crio-info.txt
  • [x] NVIDIA shared directory: ls -la /run/nvidia nvida-shared.txt
  • [x] NVIDIA packages directory: ls -la /usr/local/nvidia/toolkit nvida-packages.txt
  • [x] NVIDIA driver directory: ls -la /run/nvidia/driver nvida-driver.txt
  • [x] kubelet logs journalctl -u kubelet > kubelet.logs
  • tailed the last 1000 lines: kubelet.log

KodieGlosserIBM avatar Jul 26 '22 21:07 KodieGlosserIBM

Attached ldconfig logs here ldconfig.log

hasan4791 avatar Jul 27 '22 14:07 hasan4791

@KodieGlosserIBM Can you overwrite container-toolkit version in ClusterPolicy with v1.6.0 (which seems to have worked with v1.9.1 installs) and confirm if this is resolved?

toolkit:
  repository: nvcr.io/nvidia/k8s
  version: 1.6.0-ubi8
  image: container-toolkit

shivamerla avatar Aug 01 '22 21:08 shivamerla

I'm above to install with above change . I tried with operator v1.11

prgavali avatar Aug 04 '22 13:08 prgavali

@shivamerla: @prgavali was able to confirm that was working. Is this something we will need to add to newer versions?

KodieGlosserIBM avatar Aug 04 '22 18:08 KodieGlosserIBM

@KodieGlosserIBM For RHEL7 nodes this workaround is required when ClusterPolicy instance is created. For RHEL8 i believe this workaround is not required.

shivamerla avatar Aug 08 '22 18:08 shivamerla

@shivamerla can we get external docs for the workaround required for RHEL7?

Also, we cannot test on RHEL8 until we get driver support (this issue https://github.com/NVIDIA/gpu-operator/issues/358)

KodieGlosserIBM avatar Aug 16 '22 22:08 KodieGlosserIBM

@KodieGlosserIBM since we officially don't claim support for RHEL7 the documentation doesn't exist for this. I will check with our PMs on how to handle this case. By the way you can re-tag rhcos images for rhel8.5/rhel8.6 etc and use private images for testing until we officially publish them. We plan to publish them during our next release in September.

shivamerla avatar Aug 16 '22 23:08 shivamerla