gpu-operator icon indicating copy to clipboard operation
gpu-operator copied to clipboard

NVIDIA GPU Operator creates/configures/manages GPUs atop Kubernetes

Results 392 gpu-operator issues
Sort by recently updated
recently updated
newest added

Bumps [github.com/regclient/regclient](https://github.com/regclient/regclient) from 0.9.2 to 0.11.1. Release notes Sourced from github.com/regclient/regclient's releases. v0.11.1 Release v0.11.1 Security: Go 1.25.5 fixes CVE-2025-61729 (PR 1025) Go 1.25.5 fixes CVE-2025-61727 (PR 1025) Fixes: Correct...

dependencies

Bumps golang from 1.25.4 to 1.25.5. [![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=golang&package-manager=docker&previous-version=1.25.4&new-version=1.25.5)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a...

dependencies
docker

Hello, NVIDIA Team. I'm facing an issue while configurating `dcgm-exporter` from `gpu-operator`. I have 2 Kubernetes clusters - one is a cluster where GPU jobs run, and the other is...

good-first-issue

Implement automated forward compatibility tests that validate GPU Operator against the latest published images from NVIDIA component repositories. Changes: - Add forward-compatibility.yaml workflow (weekly + manual trigger) - Create get-latest-images.sh...

**Describe the bug** https://github.com/k0sproject/k0s/issues/6547 The two step import you introduced, `/etc/k0s/containerd.d/nvidia.toml -> /etc/containerd/conf.d/99-nvidia.toml ` breaks k0s clusters. **To Reproduce** Use gpu-operator on a k0s cluster. **Expected behavior** Don't be too...

bug
needs-triage

This commit adds the proper GPU driver wait for the MOFED driver to be ready so RDMA APIs are available when driver is recompiled. This ensures the operator supports cluster...

### Title: chore(docker): optimize Dockerfile and reduce image size ### Description: This PR improves the NVIDIA GPU Operator Dockerfile by: * Reducing the image size by cleaning DNF caches and...

### Summary When deploying GPU Operator in an **air-gapped** (offline) cluster the `nvidia-driver-daemonset` init container fails to start. Root cause: the driver image ships with a **public YUM repo** enabled...

bug

An operator error occurs roughly once a day on our H100 on which time-slicing is enabled on the `mig-1g.10gb` instances. This causes the other pods to restart as seen below...

bug