gpu-operator
gpu-operator copied to clipboard
NVIDIA GPU Operator creates/configures/manages GPUs atop Kubernetes
Bumps [github.com/regclient/regclient](https://github.com/regclient/regclient) from 0.9.2 to 0.11.1. Release notes Sourced from github.com/regclient/regclient's releases. v0.11.1 Release v0.11.1 Security: Go 1.25.5 fixes CVE-2025-61729 (PR 1025) Go 1.25.5 fixes CVE-2025-61727 (PR 1025) Fixes: Correct...
Bumps golang from 1.25.4 to 1.25.5. [](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a...
Hello, NVIDIA Team. I'm facing an issue while configurating `dcgm-exporter` from `gpu-operator`. I have 2 Kubernetes clusters - one is a cluster where GPU jobs run, and the other is...
Implement automated forward compatibility tests that validate GPU Operator against the latest published images from NVIDIA component repositories. Changes: - Add forward-compatibility.yaml workflow (weekly + manual trigger) - Create get-latest-images.sh...
**Describe the bug** https://github.com/k0sproject/k0s/issues/6547 The two step import you introduced, `/etc/k0s/containerd.d/nvidia.toml -> /etc/containerd/conf.d/99-nvidia.toml ` breaks k0s clusters. **To Reproduce** Use gpu-operator on a k0s cluster. **Expected behavior** Don't be too...
This commit adds the proper GPU driver wait for the MOFED driver to be ready so RDMA APIs are available when driver is recompiled. This ensures the operator supports cluster...
### Title: chore(docker): optimize Dockerfile and reduce image size ### Description: This PR improves the NVIDIA GPU Operator Dockerfile by: * Reducing the image size by cleaning DNF caches and...
Driver init fails in air-gapped clusters due to hard-coded mount of Red Hat subscription repo config
### Summary When deploying GPU Operator in an **air-gapped** (offline) cluster the `nvidia-driver-daemonset` init container fails to start. Root cause: the driver image ships with a **public YUM repo** enabled...
An operator error occurs roughly once a day on our H100 on which time-slicing is enabled on the `mig-1g.10gb` instances. This causes the other pods to restart as seen below...