gpu-operator issues

GPU already used, showing up in multiple containers

1

I have issue with nvidia-gpu-operator, where when setting limits for "nvidia.com/gpu: 1". I get scheduled with a GPU that is already allocated to another container. Additionally, I had previously troubles...

astranero

Nvidia Operator Documentation for installing on Amazon Linux 2023 and Bottle Rocket AMIs

1

### 1. Quick Debug Information * Bottle Rocket / AML 2023 * ContainerD * EKS ### 2. Feature description I am interested in understanding whether the Nvidia operator is compatible...

dgr237

node-feature-discovery of gpu-operator sends excessive LIST requests to the API server

3

### 1. Quick Debug Information * OS/Version(e.g. RHEL8.6, Ubuntu22.04): **Ubuntu 20.04.4 LTS** * Kernel Version: **5.4.0-113-generic** * Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): **containerd://1.5.8** * K8s Flavor/Version(e.g. K8s, OCP, Rancher,...

jslouisyou

vGPU pods stuck/fail after the installation

Hi, I'm trying to use the GPU Operator with vGPU support following [this article](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/gpu-operator-kubevirt.html) on [k3s](https://k3s.io/). **After I install the operator, vGPU pods stuck at init state, and then the...

tunahanertekin

added runtimeClassName to fix Cuda version error on gpu-pod.yaml test

The example will fail (as it is also reported in the issue https://github.com/NVIDIA/gpu-operator/issues/415) ``` Failed to allocate device vector A (error code CUDA driver version is insufficient for CUDA runtime...

armagankaratosun

Nvidia-driver-daemonset stuck in CrashLoopBackOff

Platform: Openshift 4.12 Version of the GPU Operator: 22.9.2 GPU: Tesla T4 Problem:After the creation of the ClusterPolicy we the Driver-daemonset pod enters CrashLoopBackoff with the following logs: (also complete...

CarlGJ

vsphere e2e tests setup

shivakunv

Following gpu-operator documentation will break RKE2 cluster after reboot

4

RKE2 docs says only about passing the config for RKE2's internal CONTAINERD_SOCKET: https://docs.rke2.io/advanced?_highlight=gpu#deploy-nvidia-operator Nvidia's also about CONTAINERD_CONFIG: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html#rancher-kubernetes-engine-2 Following gpu-operator documentation, those things will happen: - gpu-operator will write containerd...

aiicore

changes to allow custom labels for ServiceMonitor

2

This pr addresses issue [710](https://github.com/NVIDIA/gpu-operator/issues/710) to allow passing custom labels for gpu-operator's ServiceMonitor.

csauoss

DCGM_FI_DEV_GPU_UTIL metric giving empty value from prometheus

I have deployed GPU operator on EKS cluster using helm chart, also I have deployed prometheus, but when I am querying GPu utilization metric DCGM_FI_DEV_GPU_UTIL it is returning empty even...

Vijaygawate

gpu-operator
gpu-operator copied to clipboard

Metadata

GPU already used, showing up in multiple containers

Nvidia Operator Documentation for installing on Amazon Linux 2023 and Bottle Rocket AMIs

node-feature-discovery of gpu-operator sends excessive LIST requests to the API server

vGPU pods stuck/fail after the installation

added runtimeClassName to fix Cuda version error on gpu-pod.yaml test

Nvidia-driver-daemonset stuck in CrashLoopBackOff

vsphere e2e tests setup

Following gpu-operator documentation will break RKE2 cluster after reboot

changes to allow custom labels for ServiceMonitor

DCGM_FI_DEV_GPU_UTIL metric giving empty value from prometheus

← Metadata

Owner

Metadata

gpu-operator gpu-operator copied to clipboard

Metadata

← Metadata

Owner

Metadata

gpu-operator
gpu-operator copied to clipboard