gpu-operator icon indicating copy to clipboard operation
gpu-operator copied to clipboard

NVIDIA GPU Operator creates/configures/manages GPUs atop Kubernetes

Results 392 gpu-operator issues
Sort by recently updated
recently updated
newest added

**Describe the Ask** Many GPU(s) in the wild getting py-torched with high heat abuse and they die fast. GPU Operator shall offer a power limiting/regulating feature to avoid over-heat &...

feature

**Describe the bug** When upgrading from GPU Operator 25.3.4 to 25.10.0, having the GDS option enabled in both versions, the nvidia-fs-ctr container fails to build the module due to `gcc-12`...

bug
needs-triage

**Describe the bug** After upgrading to 25.10.0 from 25.3.2, I get error: CrashLoopBackOff (back-off 40s restarting failed container=hello-kubernetes pod=hello-kubernetes-69575f56b-9dzz4_test-ns(5e20c659-44d1-4c22-9d8f-560e3411fc58)) | Last state: Terminated with 128: StartError (failed to create containerd...

needs-triage

_**Important Note: NVIDIA AI Enterprise customers can get support from NVIDIA Enterprise support. Please open a case [here](https://enterprise-support.nvidia.com/s/create-case)**._ **Describe the bug** My host only has one A100 card. When installing...

more-information-needed

_**Important Note: NVIDIA AI Enterprise customers can get support from NVIDIA Enterprise support. Please open a case [here](https://enterprise-support.nvidia.com/s/create-case)**._ **Describe the bug** - nvidia-container-toolkit-daemonset continuously printing "failed to validate the driver,...

more-information-needed

_**Important Note: NVIDIA AI Enterprise customers can get support from NVIDIA Enterprise support. Please open a case [here](https://enterprise-support.nvidia.com/s/create-case)**._ **Describe the bug** We are trying to install the operator on OKD,...

feature

Hi team, we found above Security Vulnerability for glibc pkg in below 2 containers **nvcr.io/nvidia/gpu-operator** **nvcr.io/nvidia/cloud-native/gpu-operator-validator**  please update the pkg to latest patched version Affected Versions: 2.13 to 2.40 Fixed...

more-information-needed

## Envirnoment Check Testing is done on a bare metal single master kubeadm bootstraped cluster. - Ubuntu 22.04.4 LTS - VFIO is setup properly in host - Driver installed Kubernetes...

question

I have a few Grace Hopper 200's that I am trying to cluster up using k8s. On the host i have the 560 drivers running from the repos: ``` cat...

more-information-needed

In the page for Precompiled Driver Containers https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/precompiled-drivers.html#limitations-and-restrictions, the following limitations and restrictions are described. >Limitations and Restrictions > >Support for deploying the driver containers with precompiled drivers is limited...

question