gpu-operator icon indicating copy to clipboard operation
gpu-operator copied to clipboard

NVIDIA GPU Operator creates/configures/manages GPUs atop Kubernetes

Results 303 gpu-operator issues
Sort by recently updated
recently updated
newest added
trafficstars

I have issue with nvidia-gpu-operator, where when setting limits for "nvidia.com/gpu: 1". I get scheduled with a GPU that is already allocated to another container. Additionally, I had previously troubles...

### 1. Quick Debug Information * Bottle Rocket / AML 2023 * ContainerD * EKS ### 2. Feature description I am interested in understanding whether the Nvidia operator is compatible...

### 1. Quick Debug Information * OS/Version(e.g. RHEL8.6, Ubuntu22.04): **Ubuntu 20.04.4 LTS** * Kernel Version: **5.4.0-113-generic** * Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): **containerd://1.5.8** * K8s Flavor/Version(e.g. K8s, OCP, Rancher,...

Hi, I'm trying to use the GPU Operator with vGPU support following [this article](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/gpu-operator-kubevirt.html) on [k3s](https://k3s.io/). **After I install the operator, vGPU pods stuck at init state, and then the...

The example will fail (as it is also reported in the issue https://github.com/NVIDIA/gpu-operator/issues/415) ``` Failed to allocate device vector A (error code CUDA driver version is insufficient for CUDA runtime...

Platform: Openshift 4.12 Version of the GPU Operator: 22.9.2 GPU: Tesla T4 Problem:After the creation of the ClusterPolicy we the Driver-daemonset pod enters CrashLoopBackoff with the following logs: (also complete...

RKE2 docs says only about passing the config for RKE2's internal CONTAINERD_SOCKET: https://docs.rke2.io/advanced?_highlight=gpu#deploy-nvidia-operator Nvidia's also about CONTAINERD_CONFIG: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html#rancher-kubernetes-engine-2 Following gpu-operator documentation, those things will happen: - gpu-operator will write containerd...

This pr addresses issue [710](https://github.com/NVIDIA/gpu-operator/issues/710) to allow passing custom labels for gpu-operator's ServiceMonitor.

I have deployed GPU operator on EKS cluster using helm chart, also I have deployed prometheus, but when I am querying GPu utilization metric DCGM_FI_DEV_GPU_UTIL it is returning empty even...