gpu-operator icon indicating copy to clipboard operation
gpu-operator copied to clipboard

NVIDIA GPU Operator 24.23.0 Failed on OCP 4.14.23 Cluster

Open habjouqa opened this issue 1 year ago • 4 comments

The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.

Important Note: NVIDIA AI Enterprise customers can get support from NVIDIA Enterprise support. Please open a case here.

1. Quick Debug Information

  • OS/Version (e.g. RHEL8.6, Ubuntu22.04): Red Hat Enterprise Linux CoreOS/9.2
  • Kernel Version: 5.14.0-284.64.1.el9_2.x86_64
  • Container Runtime Type/Version (e.g. Containerd, CRI-O, Docker): CRI-O/1.27.5-2.rhaos4.14.gitbe29f54.el9
  • K8s Flavor/Version (e.g. K8s, OCP, Rancher, GKE, EKS): OCP/414.92.202404231906-0
  • GPU Operator Version: 24.23.0

2. Issue or feature description

Briefly explain the issue in terms of expected behavior and current behavior.

Installed OCP 4.14.23 cluster, where one worker node has a GPU but the gpu-cluster-policy has failed and the GPU-related pods are not working.

3. Steps to reproduce the issue

Detailed steps to reproduce the issue.

  1. Install OCP 4.14.23.
  2. Follow the instructions at IBM Docs to install "Node Feature Discovery Operator" and "Nvidia GPU Operator" from Nvidia's website.
  3. Install the Node Feature Discovery Operator: Nvidia Docs
    • Installed successfully.
  4. Install the Nvidia GPU Operator: Nvidia Docs
    • Failed. Followed steps related to GPU (not vGPU) using the web console (not the CLI).
  5. Create the ClusterPolicy instance.
  6. Create the cluster policy using the web console:
    • By the end of this step, the state of the gpu-cluster-policy should be "State: ready," but the state is "State: not ready," causing the GPU-related pods to fail (see attached screenshot "gpu-pods.png").
    • ClusterPolicy logs indicate: "ClusterPolicy is not ready, states not ready: [state-driver state-container-toolkit state-operator-validation state-device-plugin state-dcgm state-dcgm-exporter gpu-feature-discovery]".

I contacted IBM Support, and they referred me to the logs at /var/log/nvidia-installer.log. These logs show failed driver installations. I uninstalled the "Node Feature Discovery Operator" and "Nvidia GPU Operator," then reinstalled them following the same steps and restarted the node. The drivers are now successfully installed. Attached are two nvidia-installer logs showing the state before and after the restart:

  • "nvidia-installer_AfterRestart.log"
  • "nvidia-installer_BeforeRestart.log"

4. Information to attach (optional if deemed irrelevant)

nvidia-installer_BeforeRestart.log nvidia-installer_AfterRestart.log

  • [ ] kubernetes pods status: kubectl get pods -n OPERATOR_NAMESPACE
  • [ ] kubernetes daemonset status: kubectl get ds -n OPERATOR_NAMESPACE
  • [ ] If a pod/ds is in an error state or pending state kubectl describe pod -n OPERATOR_NAMESPACE POD_NAME
  • [ ] If a pod/ds is in an error state or pending state kubectl logs -n OPERATOR_NAMESPACE POD_NAME --all-containers
  • [ ] Output from running nvidia-smi from the driver container: kubectl exec DRIVER_POD_NAME -n OPERATOR_NAMESPACE -c nvidia-driver-ctr -- nvidia-smi
  • [ ] containerd logs journalctl -u containerd > containerd.log

Collecting full debug bundle (optional):

curl -o must-gather.sh -L https://raw.githubusercontent.com/NVIDIA/gpu-operator/master/hack/must-gather.sh 
chmod +x must-gather.sh
./must-gather.sh

NOTE: please refer to the must-gather script for debug data collected.

This bundle can be submitted to us via email: [email protected]

### Tasks

habjouqa avatar May 16 '24 11:05 habjouqa

I've just send the must gather to the following email: [email protected]

habjouqa avatar May 16 '24 11:05 habjouqa

thanks @habjouqa will take a look

shivamerla avatar May 17 '24 14:05 shivamerla

Hello @shivamerla, has there been any update on this?

AlessandroPomponio avatar Jun 06 '24 09:06 AlessandroPomponio

There's a workaround that doesn't involve reinstalling the operator:

  1. Take a backup of your ClusterPolicy manifest
  2. Delete the ClusterPolicy
  3. Create the same ClusterPolicy again based on your backup

After a few minutes the items in states not ready: should resolve themselves.

psy-q avatar Jul 11 '24 09:07 psy-q

This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed. To skip these checks, apply the "lifecycle/frozen" label.

github-actions[bot] avatar Nov 05 '25 00:11 github-actions[bot]

This issue has been open for over 90 days without recent updates, and the context may now be outdated.

Given that gpu-operator 24.3.0 is EOL now, I would encourage you to try latest version and see if you still see this issue.

If this issue is still relevant with the latest version of the NVIDIA GPU Operator, please feel free to reopen it or open a new one with updated details.

rahulait avatar Nov 14 '25 05:11 rahulait