gpushare-scheduler-extender
gpushare-scheduler-extender copied to clipboard
创建pod报错nvidia-container-cli: device error: unknown device id: no-gpu-has-10MiB-to-run
我使用的集群是1.16版本,按照教程安装完成后,aliyun.com/gpu-count: 2 gpu-mem:22 创建pod时aliyun.com/gpu-mem: 10,但是会报错stderr: nvidia-container-cli: device error: unknown device id: no-gpu-has-10MiB-to-run; 如果指定gpu-count: 1 报错Back-off restarting failed container
可能是什么原因?
环境: docker 19.03.5
nvidia 驱动 410.48 ; NVIDIA Corporation GP102 [TITAN X] ; nvidia-docker2-2.2.2 ; Centos7.6 go 1.13.5 ;
你要看一下gpushare-device-plugin的日志。我怀疑gpushare-scheduler-extender没有正确配置。
I have the same issue. it shows:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 28s default-scheduler Successfully assigned kong/binpack-2-54df84c8d7-nknfx to 192.168.3.4
Normal Pulled <invalid> (x3 over <invalid>) kubelet, 192.168.3.4 Container image "cheyang/gpu-player:v2" already present on machine
Normal Created <invalid> (x3 over <invalid>) kubelet, 192.168.3.4 Created container binpack-2
Warning Failed <invalid> (x3 over <invalid>) kubelet, 192.168.3.4 Error: failed to start container "binpack-2": Error response from daemon: OCI runtime create failed: container_linux.go:345: starting container process caused "process_linux.go:424: container init caused \"process_linux.go:407: running prestart hook 0 caused \\\"error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: device error: unknown device id: no-gpu-has-2MiB-to-run\\\\n\\\"\"": unknown
I also checked gpushare-device-plugin. The log is here:
I0220 10:16:27.337772 1 main.go:18] Start gpushare device plugin
I0220 10:16:27.337870 1 gpumanager.go:28] Loading NVML
I0220 10:16:27.365052 1 gpumanager.go:37] Fetching devices.
I0220 10:16:27.365082 1 gpumanager.go:43] Starting FS watcher.
I0220 10:16:27.365204 1 gpumanager.go:51] Starting OS watcher.
I0220 10:16:27.392614 1 nvidia.go:64] Deivce GPU-5f44aae0-ca45-9038-5202-a033fa4f471a's Path is /dev/nvidia0
I0220 10:16:27.392682 1 nvidia.go:69] # device Memory: 12036
I0220 10:16:27.392691 1 nvidia.go:40] set gpu memory: 11
I0220 10:16:27.392699 1 nvidia.go:76] # Add first device ID: GPU-5f44aae0-ca45-9038-5202-a033fa4f471a-_-0
I0220 10:16:27.392713 1 nvidia.go:79] # Add last device ID: GPU-5f44aae0-ca45-9038-5202-a033fa4f471a-_-10
I0220 10:16:27.421574 1 nvidia.go:64] Deivce GPU-28071eed-0993-c165-e123-ea818a546f14's Path is /dev/nvidia1
I0220 10:16:27.421596 1 nvidia.go:69] # device Memory: 12036
I0220 10:16:27.421604 1 nvidia.go:76] # Add first device ID: GPU-28071eed-0993-c165-e123-ea818a546f14-_-0
I0220 10:16:27.421628 1 nvidia.go:79] # Add last device ID: GPU-28071eed-0993-c165-e123-ea818a546f14-_-10
I0220 10:16:27.453463 1 nvidia.go:64] Deivce GPU-203884e8-4afc-ad03-1a15-2ff5b34fc01c's Path is /dev/nvidia2
I0220 10:16:27.453482 1 nvidia.go:69] # device Memory: 12036
I0220 10:16:27.453490 1 nvidia.go:76] # Add first device ID: GPU-203884e8-4afc-ad03-1a15-2ff5b34fc01c-_-0
I0220 10:16:27.453502 1 nvidia.go:79] # Add last device ID: GPU-203884e8-4afc-ad03-1a15-2ff5b34fc01c-_-10
I0220 10:16:27.480145 1 nvidia.go:64] Deivce GPU-d0cd36b5-9221-facd-203c-b2342b207439's Path is /dev/nvidia3
I0220 10:16:27.480166 1 nvidia.go:69] # device Memory: 12036
I0220 10:16:27.480172 1 nvidia.go:76] # Add first device ID: GPU-d0cd36b5-9221-facd-203c-b2342b207439-_-0
I0220 10:16:27.480190 1 nvidia.go:79] # Add last device ID: GPU-d0cd36b5-9221-facd-203c-b2342b207439-_-10
I0220 10:16:27.501184 1 nvidia.go:64] Deivce GPU-c3ea63e8-ccde-4e49-7093-d570e14d82c2's Path is /dev/nvidia4
I0220 10:16:27.501203 1 nvidia.go:69] # device Memory: 12036
I0220 10:16:27.501209 1 nvidia.go:76] # Add first device ID: GPU-c3ea63e8-ccde-4e49-7093-d570e14d82c2-_-0
I0220 10:16:27.501216 1 nvidia.go:79] # Add last device ID: GPU-c3ea63e8-ccde-4e49-7093-d570e14d82c2-_-10
I0220 10:16:27.524208 1 nvidia.go:64] Deivce GPU-b663687a-4219-c707-1c3c-8ddcc59b9dbd's Path is /dev/nvidia5
I0220 10:16:27.524226 1 nvidia.go:69] # device Memory: 12036
I0220 10:16:27.524231 1 nvidia.go:76] # Add first device ID: GPU-b663687a-4219-c707-1c3c-8ddcc59b9dbd-_-0
I0220 10:16:27.524243 1 nvidia.go:79] # Add last device ID: GPU-b663687a-4219-c707-1c3c-8ddcc59b9dbd-_-10
I0220 10:16:27.547600 1 nvidia.go:64] Deivce GPU-8d72b13d-0942-e5f4-e2bc-e77d29e88be1's Path is /dev/nvidia6
I0220 10:16:27.547627 1 nvidia.go:69] # device Memory: 12036
I0220 10:16:27.547635 1 nvidia.go:76] # Add first device ID: GPU-8d72b13d-0942-e5f4-e2bc-e77d29e88be1-_-0
I0220 10:16:27.547653 1 nvidia.go:79] # Add last device ID: GPU-8d72b13d-0942-e5f4-e2bc-e77d29e88be1-_-10
I0220 10:16:27.573674 1 nvidia.go:64] Deivce GPU-7d1c58db-fc7d-8042-f86e-f37297c2a1c9's Path is /dev/nvidia7
I0220 10:16:27.573696 1 nvidia.go:69] # device Memory: 12036
I0220 10:16:27.573704 1 nvidia.go:76] # Add first device ID: GPU-7d1c58db-fc7d-8042-f86e-f37297c2a1c9-_-0
I0220 10:16:27.573718 1 nvidia.go:79] # Add last device ID: GPU-7d1c58db-fc7d-8042-f86e-f37297c2a1c9-_-10
I0220 10:16:27.573736 1 server.go:43] Device Map: map[GPU-b663687a-4219-c707-1c3c-8ddcc59b9dbd:5 GPU-8d72b13d-0942-e5f4-e2bc-e77d29e88be1:6 GPU-7d1c58db-fc7d-8042-f86e-f37297c2a1c9:7 GPU-5f44aae0-ca45-9038-5202-a033fa4f471a:0 GPU-28071eed-0993-c165-e123-ea818a546f14:1 GPU-203884e8-4afc-ad03-1a15-2ff5b34fc01c:2 GPU-d0cd36b5-9221-facd-203c-b2342b207439:3 GPU-c3ea63e8-ccde-4e49-7093-d570e14d82c2:4]
I0220 10:16:27.573807 1 server.go:44] Device List: [GPU-5f44aae0-ca45-9038-5202-a033fa4f471a GPU-28071eed-0993-c165-e123-ea818a546f14 GPU-203884e8-4afc-ad03-1a15-2ff5b34fc01c GPU-d0cd36b5-9221-facd-203c-b2342b207439 GPU-c3ea63e8-ccde-4e49-7093-d570e14d82c2 GPU-b663687a-4219-c707-1c3c-8ddcc59b9dbd GPU-8d72b13d-0942-e5f4-e2bc-e77d29e88be1 GPU-7d1c58db-fc7d-8042-f86e-f37297c2a1c9]
I0220 10:16:27.592888 1 podmanager.go:68] No need to update Capacity aliyun.com/gpu-count
I0220 10:16:27.593476 1 server.go:222] Starting to serve on /var/lib/kubelet/device-plugins/aliyungpushare.sock
I0220 10:16:27.594247 1 server.go:230] Registered device plugin with Kubelet
I0220 16:33:58.048842 1 allocate.go:46] ----Allocating GPU for gpu mem is started----
I0220 16:33:58.048863 1 allocate.go:57] RequestPodGPUs: 2
I0220 16:33:58.048868 1 allocate.go:61] checking...
I0220 16:33:58.063205 1 podmanager.go:112] all pod list [{{ } {binpack-2-54df84c8d7-nknfx binpack-2-54df84c8d7- kong /api/v1/namespaces/kong/pods/binpack-2-54df84c8d7-nknfx ea316ce3-fd64-413e-b5fe-c06baff6444b 366992 0 2020-02-20 16:32:54 +0000 UTC <nil> <nil> map[app:binpack-2 pod-template-hash:54df84c8d7] map[] [{apps/v1 ReplicaSet binpack-2-54df84c8d7 a0faf71e-cd14-4614-a9f4-d87a236badd1 0xc4203bfe6a 0xc4203bfe6b}] nil [] } {[{default-token-kj8jt {nil nil nil nil nil &SecretVolumeSource{SecretName:default-token-kj8jt,Items:[],DefaultMode:*420,Optional:nil,} nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil}}] [] [{binpack-2 cheyang/gpu-player:v2 [] [] [] [] [] {map[aliyun.com/gpu-mem:{{2 0} {<nil>} 2 DecimalSI}] map[aliyun.com/gpu-mem:{{2 0} {<nil>} 2 DecimalSI}]} [{default-token-kj8jt true /var/run/secrets/kubernetes.io/serviceaccount <nil>}] [] nil nil nil /dev/termination-log File IfNotPresent nil false false false}] Always 0xc4203bff08 <nil> ClusterFirst map[] default default <nil> 192.168.3.4 false false false <nil> &PodSecurityContext{SELinuxOptions:nil,RunAsUser:nil,RunAsNonRoot:nil,SupplementalGroups:[],FSGroup:nil,RunAsGroup:nil,Sysctls:[],} [] nil default-scheduler [{node.kubernetes.io/not-ready Exists NoExecute 0xc4203bffa0} {node.kubernetes.io/unreachable Exists NoExecute 0xc4203bffc0}] [] 0xc4203bffd0 nil []} {Pending [{PodScheduled True 0001-01-01 00:00:00 +0000 UTC 2020-02-20 16:32:54 +0000 UTC }] <nil> [] [] BestEffort}}]
I0220 16:33:58.063418 1 podmanager.go:123] list pod binpack-2-54df84c8d7-nknfx in ns kong in node 192.168.3.4 and status is Pending
I0220 16:33:58.063430 1 podutils.go:81] No assume timestamp for pod binpack-2-54df84c8d7-nknfx in namespace kong, so it's not GPUSharedAssumed assumed pod.
W0220 16:33:58.063439 1 allocate.go:152] invalid allocation requst: request GPU memory 2 can't be satisfied.
Hi guys, I also have this issue when i use demo1:
Warning Failed 3m41s (x5 over 5m20s) kubelet, k8s-master Error: failed to start container "binpack-1": Error response from daemon: OCI runtime create failed: container_linux.go:345: starting container process caused "process_linux.go:430: container init caused \"process_linux.go:413: running prestart hook 0 caused \\\"error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: device error: unknown device id: no-gpu-has-2MiB-to-run\\\\n\\\"\"": unknown
My docker version is 19.3.3 nvidia driver vesion is 418.116.00, and i use kubectl-inspect-gpushare it shows:
NAME IPADDRESS GPU0(Allocated/Total) PENDING(Allocated) GPU Memory(GiB) k8s-master 192.168.1.103 0/7 2 2/7
Allocated/Total GPU Memory In Cluster: 2/7 (28%)
The gpushare-device-plugin log is same as above. Is there any solution can share with me,thanks.
nvidia 驱动 410.48 ; NVIDIA Corporation GP102 [TITAN X] ; nvidia-docker2-2.2.2 ; Centos7.6 go 1.13.5 ;
我使用的集群是1.16版本,按照教程安装完成后,aliyun.com/gpu-count: 2 gpu-mem:22 创建pod时aliyun.com/gpu-mem: 10,但是会报错stderr: nvidia-container-cli: device error: unknown device id: no-gpu-has-10MiB-to-run; 如果指定gpu-count: 1 报错Back-off restarting failed container
可能是什么原因?
环境: docker 19.03.5
可以尝试一下在container中添加env,如下:
containers:
- name: cuda
image: nvidia/cuda:latest
env:
- name: NVIDIA_VISIBLE_DEVICES
value: "all"
resources:
limits:
# GiB
aliyun.com/gpu-mem: 1
这对于我的类似错误有效。
Fixed my issue by re-creating /etc/kubernetes/manifests/kube-scheduler.yaml
to make static scheduler pod use correct configuration.
I tried what @Carlosnight says, but when I run kubectl inspect gpushare
I see a new column called PENDING(Allocated) and it seems that GPU isolation not happened.
I checked the value of ALIYUN_COM_GPU_MEM_IDX
inside my pod and it's -1.
P.S: I have a warning in gpushare-schd-extender
that says: Pod gpu-pod in ns production is not set the GPU ID -1 in node xxxxxxxxx
P.S: The DaemonSet log shows that it's found GPU but repeatedly saying about not able to assign 3GiB memory to the pod.
Same issue here, any feedback on this?
@Svendegroote91 in my case, it was the wrong configuration for kube-scheduler.yaml
. I recommend you to check it and read installation guide again
@Mhs-220 I attached the kube-scheduler.yaml.zip from my controller node.
I have 3 controller nodes but only did the kube-scheduler update on the controller node (node "ctr-1" in my case) through which I am using the KubeAPI (I have no HA on top of the controller nodes at the moment). That should be sufficient no? You can see that the kube-scheduler restarted after updating the file:
Maybe it helps if I share the logs of the gpushare-schd-extender together with the logs of the actual container:
What strikes me is that it first goes into "Pending" state and subsequently to "Running" but the status in the kubectl inspect gpushare
is not moving.
Can you elaborate what exactly you forgot to apply to the kube-scheduler.yaml file or point to the mistake in my attached file? Your help would be appreciated a lot!
I solved it - I had to update all my master nodes with the instruction from the installation guide.
Can somebody explain why the environment variable NVIDIA_VISIBLE_DEVICES
fixes this and why it is needed in the manifest file?
The error is happening again after update kubernetes to 1.18. Any Idea? @Svendegroote91 what is your cluster version?
@Mhs-220 my cluster version is v1.15.11 (because I am using Kubeflow on top of Kubernetes and v1.15 is fully supported)
In my case, it does not work with v1.17.4, but works well with 1.15.11. Thanks to @Mhs-220 and @Svendegroote91.
I do not know why it works or why it does not work.
v1.16.15 has the same problem. Seems that v1.16 is a version where API changes a lot. I have seen some deprecated "pre-1.16" versions in other projects, like shown here. Maybe @cheyang can help us out?
nvidia 驱动 410.48 ; NVIDIA Corporation GP102 [TITAN X] ; nvidia-docker2-2.2.2 ; Centos7.6 go 1.13.5 ;
我使用的集群是1.16版本,按照教程安装完成后,aliyun.com/gpu-count: 2 gpu-mem:22 创建pod时aliyun.com/gpu-mem: 10,但是会报错stderr: nvidia-container-cli: device error: unknown device id: no-gpu-has-10MiB-to-run; 如果指定gpu-count: 1 报错Back-off restarting failed container 可能是什么原因? 环境: docker 19.03.5
可以尝试一下在container中添加env,如下:
containers: - name: cuda image: nvidia/cuda:latest env: - name: NVIDIA_VISIBLE_DEVICES value: "all" resources: limits: # GiB aliyun.com/gpu-mem: 1
这对于我的类似错误有效。
在我们的尝试中,加上 NVIDIA_VISIBLE_DEVICES=all
这个环境变量看起来并不是一个合理的解决方法;加上之后会导致容器实际使用的显卡与 scheduler
分配的显卡并不匹配,从而导致其他问题。这一点可以通过 k exec [pod] -- env
查看正常的 pod 环境变量确认;可以发现 scheduler
会通过类似于 NVIDIA_VISIBLE_DEVICES=GPU-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
的环境变量控制 container 使用的显卡。而且加上 =all
的设置之后,通过 nvidia-smi
检查也会发现实际使用的显卡和分配的不匹配。
我们碰到的问题是 container 实际使用的内存大小超出了 k8s 设置的限制(或者是本身运行了一些非 k8s 管理的程序占用了显存),导致虽然 scheduler
认为卡上还有显存(于是调度到了该卡),但是实际上已经没有足够的显存。
最后我们的解决方法是严格保证实际使用的显存小于限制值,并且将非 k8s 管理的程序迁移到其他服务器(或集群中)。
In our case, adding the NVIDIA_VISIBLE_DEVICES=all
env config is not a reasonable solution to this problem. Add NVIDIA_VISIBLE_DEVICES
will cause that the GPU card the container actually used doesn't match the GPU card scheduler
assigned to the container, and this will cause other issues. And this can be verified by k exec [pod] -- env
, you will find a NVIDIA_VISIBLE_DEVICES=GPU-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
even you don't specify one. Also, the nvidia-smi
can verify the actually used GPU card is not matched with the assigned one if you added NVIDIA_VISIBLE_DEVICES=all
.
The cause of our situation is that the container is using more memory than we give it (or there are some processes unmanaged by k8s using this card). Although the scheduler
thinks there's enough memory (so assign the process to that card), it finds that there isn't enough memory left when it actually runs.
And finally, we ensure all containers will use less memory than we assign to them and move all processes which are not managed by k8s to other machines.
versions
- docker 19.03.11
- kubernetes v1.19.8
processing
containers:
- name: cuda
image: nvidia/cuda:latest
env:
- name: NVIDIA_VISIBLE_DEVICES
value: "all"
resources:
limits:
# GiB
aliyun.com/gpu-mem: 1
- after add the environment variable NVIDIA_VISIBLE_DEVICES=all the program can be run but the result of nvidia-smi comand , gpu memory, may be not correct or something wrong
No running processes found
- 当我使用了NVIDIA_VISIBLE_DEVICES=all之后, 我的应用可以正常启动了, 但是nvidia-smi命令的输出结果显示的内存大小似乎不符合预期或者某个地方出现了什么问题, 例如:
No running processes found
[root@k8s243 models]# nvidia-smi
Thu Sep 9 17:21:00 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01 Driver Version: 440.33.01 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla P4 Off | 00000000:00:10.0 Off | 0 |
| N/A 31C P8 6W / 75W | 0MiB / 7611MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
kubectl inspect gpushare
NAME IPADDRESS GPU0(Allocated/Total) GPU Memory(GiB)
k8s243 10.3.171.243 2/7 2/7
k8s25 10.3.144.25 0/14 0/14
------------------------------------------------
Allocated/Total GPU Memory In Cluster:
2/21 (9%)
I encountered the same problem after installing according to the instructions. Later, I found that the other master had not been adjusted according to the instructions. After I modified the scheduler component configuration of all two masters, they worked normally.
分享下我的解决方案以供大家参考。我的问题出在修改 kube-scheduler.yaml
上。
之前错误的方式:
-
cd /etc/kubernetes/manifests
-
cp kube-scheduler.yaml kube-scheduler.yaml.backup
- 按照安装教程直接修改
kube-scheduler.yaml
问题就出在第 2 步: 在 /etc/kubernetes/manifests
下现在有两个文件 kube-scheduler.yaml
和 kube-scheduler.yaml.backup
,kubernetes 会把它们全部加载,可能因为后者会覆盖前者导致配置修改不生效。
正确的方式:
-
cd /etc/kubernetes
-
mv manifests/kube-scheduler.yaml .
,把kube-scheduer.yaml
先移出manifests
目录。 -
cp kube-scheduler.yaml kube-scheduler.yaml.backup
,如果不想备份可以跳过这一步。 - 按照安装教程直接修改
kube-scheduler.yaml
-
mv kube-scheduler.yaml manifests
,将修改好的文件重新放入manifests
文件夹。
已收到您的邮件,我将及时查看并回复,谢谢 王鑫
Can somebody explain why the environment variable
NVIDIA_VISIBLE_DEVICES
fixes this and why it is needed in the manifest file?
seems like this is how it works https://github.com/AliyunContainerService/gpushare-device-plugin/issues/55#issue-1439746016
你好,邮件已收到。我处理完会马上给你回复,祝生活愉快!
请教下您看的哪个安装教程吗