gpu-operator icon indicating copy to clipboard operation
gpu-operator copied to clipboard

NVIDIA GPU Operator creates/configures/manages GPUs atop Kubernetes

Results 392 gpu-operator issues
Sort by recently updated
recently updated
newest added

Main issue: not able to use GPU inside minikube due to permission issues. ### 1. Quick Debug Information * OS/Version(e.g. RHEL8.6, Ubuntu22.04): ``` >> uname -a Linux xxx 6.5.0-25-generic #25~22.04.1-Ubuntu...

How can I configure the GPU Operator so that it automatically mounts the `nvidia-settings` `nvidia-xconfig` binaries? The following binaries are automatically mounted from the host `/run/nvidia/driver/usr/bin/*` to the Pod by...

There are two GDS-related commits to fix the following issue. However, the two commits are not merged(cherry-picked) from master to v23.9 branch. I have tested the v23.9.2 gpu-operator, repoConfig is...

### 1. Quick Debug Information * OS/Version(e.g. RHEL8.6, Ubuntu22.04): AlmaLinux 9.3 * Kernel Version: 5.14.0-362.18.1.el9_3.x86_64 * Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): CRI-O v1.24.0 * K8s Flavor/Version(e.g. K8s, OCP, Rancher,...

Hi Team, I've deployed an NVIDIA GPU Operator in Azure AKS 1.27.7. The pods are up/running. However, I see critical vulnerabilities in "gpu-operator" image. Please have a look at it....

### 1. Quick Debug Information * OS/Version(e.g. RHEL8.6, Ubuntu22.04): Ubuntu 20.04 for EKS (ARM) / ami-09b6385a90c8d3cee * Kernel Version: 5.15.0-1041-aws * Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): containerd 1.7.2 *...

### 1. Quick Debug Information * Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): CRI-O * K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): Openshift v4.13.23 * GPU Operator Version: 23.9.1 , gpu-operator-certified.v1.11.1...

### Feature Description AFAIUC, `validator` app currently only validates the driver installation. It would be great to have additional validating steps for the DCGM installation. It can be enabled with...

feature

### 1. Quick Debug Information * OS/Version(e.g. RHEL8.6, Ubuntu22.04): Ubuntu 22.04 * Kernel Version: 6.2 * Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): Containerd * K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE,...

needs-triage

This commit allows someone not to set a clusterwide policy for GPUs It will take the default behaviour. Only if a node has selector for a particular devicePlugin config (with...