gpu-operator icon indicating copy to clipboard operation
gpu-operator copied to clipboard

Nvidia GPU operator issue on openshift(4.17.20)

Open Nikhil-VW opened this issue 8 months ago • 7 comments

Hello Team,

Recently, while upgrading the OpenShift cluster version from 4.17.18 to 4.17.20 version (Kubernetes version - v1.30.10), the NVIDIA GPU Operator was upgraded to version V25.3 After the upgrade, the NVIDIA GPU Operator was down, which impacted the applications using the GPU. All the pods(dcgm,validator,dcgm-exporter) from the NVIDIA GPU Operator were in the "init" stage and also gpu cluster policy was in not ready stage . We later rolled back to the previous version, V24.9.2, which eventually resolved the issue.

Could the new version, V25.3, be unsupported by the latest OpenShift and Kubernetes version, or did it contain bugs that caused the issue? Can someone investigate this and let us know the major cause of the problem? We need to document the reason and provide it to the business department as part of the post-mortem analysis.

Let me know if you require any details.



Thanks,

Nikhil-VW avatar Mar 28 '25 11:03 Nikhil-VW

Would you be able to describe which component was failing and provide logs from it? Alternatively, you can provide the full debug bundle by running:

curl -o must-gather.sh -L https://raw.githubusercontent.com/NVIDIA/gpu-operator/main/hack/must-gather.sh
chmod +x must-gather.sh
./must-gather.sh

This bundle can be submitted to us via email: [email protected]

cdesiniotis avatar Mar 28 '25 18:03 cdesiniotis

Hello Christopher,

I have sent you the mail with attachment, please check ane let us know if you require any other details.

Nikhil-VW avatar Apr 01 '25 06:04 Nikhil-VW

It appears the must-gather logs you collected were for the working v24.9.2 installation. Would you be able to capture the must-gather logs from the failing v25.3.0 install?

cdesiniotis avatar Apr 01 '25 06:04 cdesiniotis

cc @empovit

cdesiniotis avatar Apr 01 '25 06:04 cdesiniotis

Hi @empovit @cdesiniotis We had an issue regarding the newer version of GPU operator i.e version v25.3.0, as discussed we where facing issue regarding the newer version, what we did is we installed the sample gpu pod and the pod is status is in pending state, as you have requested to the the specific GPU debug and asked us to execute the command i.e

curl -o must-gather.sh -L https://raw.githubusercontent.com/NVIDIA/gpu-operator/main/hack/must-gather.sh chmod +x must-gather.sh ./must-gather.sh

After execution we get output logs so that they can help you in resolving the query.

gpu-logs.zip

kaifrazaVWITS avatar Apr 22 '25 09:04 kaifrazaVWITS

@kaifrazaVWITS according to the logs, you're running on g3s.xlarge which have NVIDIA Tesla M60. Unfortunately, this GPU model doesn't seem supported by the GPU operator, according to https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/25.3.0/platform-support.html#supported-nvidia-data-center-gpus-and-systems, although the driver is supposed to work https://docs.nvidia.com/datacenter/tesla/tesla-release-notes-570-124-06/index.html#hardware-software-support.

@cdesiniotis will you have any insights?

empovit avatar Apr 23 '25 06:04 empovit

This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed. To skip these checks, apply the "lifecycle/frozen" label.

github-actions[bot] avatar Nov 04 '25 22:11 github-actions[bot]