gpu-operator issues

nvidia-driver-daemonset always fails on Ubuntu 20.04.2

4

_The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense._...

aipredict

Error when trying to use operator on DGX A100-80GB with microk8s and mixed strategy MIG

19

### 1. Issue or feature description On a DGX A100-80GB, trying to install the operator with mixed strategy MIG, feature discovery/node labeling work fine with MIG disabled, but as soon...

reuben

WSL2 Support

10

Hi, I wonder if it's possible to use the gpu-operator in a single-node Microk8s cluster hosted on a wsl2 Ubuntu distribution. Thanks.

mchikyt3

nvidia-container-runtime prevents pods to terminate

4

It seems that it may happen, that /usr/local/nvidia/toolkit/nvidia-container-runtime fails it it runs from a directory that already does not exist. I can see the following in the kubelet.log ``` E0201...

xhejtman

repoConfig to override /etc/apt/sources.list is not working

6

k8s - 1.18.10, self-hosted, **w/o Internet access** workers - Ubuntu 18.04.4 #-------------------------------------- `nvidia-driver-daemonset` pod fails during packages update (because of private cluster): ``` Checking NVIDIA driver packages... Updating the package...

withoutnickname

K8s cluster with two gpu nodes with centos 7, centos 8

3

I have a 6 node Kubernetes cluster with a GPU operator 1.9 installed. I have 2 GPU servers ec2 type - p2, p3 on AWS. I have installed centos 7...

sricharanrobinsystems

Question : is it possible to allow deployment when nvidia-smi returns code different from 0

14

Hello, I'm facing some issues trying to make a GPU available in a kubernetes cluster. Based on my investigations, deployment process stops at nvidia-driver-daemonsets being blocked because the driver-validator which...

ymazzer

Pods take 25-30 minutes to terminate

2

### 1. Quick Debug Checklist - [ ] Are you running on an Ubuntu 18.04 node? [No -- CentOS Linux release 7.6.1810 (Core)] - [X] Are you running Kubernetes v1.13+?...

dbugit

Cannot find nvidia-smi in $PATH in toolkit-validation

13

The following installation will fail with "Cannot find nvidia-smi in $PATH" ``` helm install -n gpu-operator gpu-operator nvidia/gpu-operator --version=v1.7.1 --set driver.version=460.32.03 --set toolkit.version=1.5.0-ubuntu18.04 --set operator.defaultRuntime=containerd --set toolkit.env[0].name=CONTAINERD_CONFIG --set toolkit.env[0].value=/etc/containerd/config.toml --set...

mastier

OS support for GPU operator

1

I successfully have the GPU operator running in a k8s cluster on centos 7. Being that centos 7 will EOL in about 2 years and centos 8 is EOL, what...

cocampbe

gpu-operator
gpu-operator copied to clipboard

Metadata

nvidia-driver-daemonset always fails on Ubuntu 20.04.2

Error when trying to use operator on DGX A100-80GB with microk8s and mixed strategy MIG

WSL2 Support

nvidia-container-runtime prevents pods to terminate

repoConfig to override /etc/apt/sources.list is not working

K8s cluster with two gpu nodes with centos 7, centos 8

Question : is it possible to allow deployment when nvidia-smi returns code different from 0

Pods take 25-30 minutes to terminate

Cannot find nvidia-smi in $PATH in toolkit-validation

OS support for GPU operator

← Metadata

Owner

Metadata

gpu-operator gpu-operator copied to clipboard

Metadata

← Metadata

Owner

Metadata

gpu-operator
gpu-operator copied to clipboard