gpu-operator icon indicating copy to clipboard operation
gpu-operator copied to clipboard

NVIDIA GPU Operator creates/configures/manages GPUs atop Kubernetes

Results 392 gpu-operator issues
Sort by recently updated
recently updated
newest added

I’m planning to upgrade nvidia-driver version node by node to make some pods alive during update. Is there any way to upgrade drivers node by node?

enhancement

产品:目前使用英伟达 Tesla T4显卡 问题: 目前通过 nvidia-smi 可以看到显存大小和某个进程使用的显存大小; [https://files.51wyq.cn/tmp/image001.png](url) 通过gpu exporter 也可以监控到每个显卡显存的动态使用情况, [https://files.51wyq.cn/tmp/image002.png](url) 由于我们是多个进程同时使用一块显卡,无法检测到进程的显存动态使用情况。 请问有什么工具可以直接检测进程的显存使用情况,希望某个进程的显存使用情况也可以绘制成图,请问有没有现成工具(我们目前用了kubernets),谢谢! [https://files.51wyq.cn/tmp/image003.png](url)

Question description: In dir deployments/gpu-operator/templates/, operator.yaml contains WATCH_NAMESPACE env, bug in main.go ctrl.Options struct does not contains Namespace fields, so gpu-operator does not support ns level scope. Personal ideas: In...

_The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense._...

Hi, We need to deploy the operator on SLES15-SP3 hosts. Seems like the operator detects hosts and want to pull nvcr.io/nvidia/driver:470.82.01-sles15.3 But this image does not exists. https://gitlab.com/nvidia/container-images/driver https://gitlab.com/nvidia/container-images/cuda/-/tree/master/dist/11.6.0 https://catalog.ngc.nvidia.com/orgs/nvidia/containers/driver/tags...

_The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense._...

_The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense._...

### 1. Quick Debug Checklist - [x] Are you running Kubernetes v1.13+? - [x] Did you apply the CRD (`kubectl describe clusterpolicies --all-namespaces`) ### 1. Issue or feature description node-deature-discovery...

### 1. Quick Debug Checklist - [ ] Are you running on an Ubuntu 18.04 node? - [ ] Are you running Kubernetes v1.13+? - [ ] Are you running...

OS version : rockylinux 8.6 nvidia-fabricmanager failed to start ``` [root@test-rocky8-kvm63 ~]# systemctl start nvidia-fabricmanager Job for nvidia-fabricmanager.service failed because the control process exited with error code. See "systemctl status...