gpushare-device-plugin
gpushare-device-plugin copied to clipboard
nvidia-container-cli: device error: unknown device id: no-gpu-has-256MiB-to-run\\\\n\\\"\"": unknown
版本信息:
k8s: 1.17 gpushare-device-plugin: v2-1.11-aff8a23 nvidia-smi: 440.36
kubectl descript pod < pod name > -n zhaogaolong
pod errors log
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled <unknown> default-scheduler Successfully assigned zhaogaolong/gpu-demo-gpushare-659fd6cbb7-6fc8v to gpu-node
Normal Pulling 32s (x4 over 70s) kubelet, gpu-node Pulling image "hub.xxxx.com/zhaogaolong/gpu-demo.build.build:bccfcbe43f43280d-1584070500-dac37f2c12024544a6cc2871440dc94a577a7ff3"
Normal Pulled 32s (x4 over 70s) kubelet, gpu-node Successfully pulled image "hub.xxx.com/zhaogaolong/gpu-demo.build.build:bccfcbe43f43280d-1584070500-dac37f2c12024544a6cc2871440dc94a577a7ff3"
Normal Created 31s (x4 over 70s) kubelet, gpu-node Created container gpu
Warning Failed 31s (x4 over 70s) kubelet, gpu-node Error: failed to start container "gpu": Error response from daemon: OCI runtime create failed: container_linux.go:345: starting container process caused "process_linux.go:430: container init caused \"process_linux.go:413: running prestart hook 0 caused \\\"error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: device error: unknown device id: no-gpu-has-256MiB-to-run\\\\n\\\"\"": unknown
Warning BackOff 10s (x5 over 68s) kubelet, ggpu-node Back-off restarting failed container
相同问题:
https://github.com/NVIDIA/nvidia-docker/issues/1042
@cheyang
Is anybody fix this? I got the same problem here. https://github.com/AliyunContainerService/gpushare-scheduler-extender/issues/120#issue-665519945
I encountered the same issue today. Can anybody help please?
update your schedule config with gpushare-sch-extender and restart it. you might be able to fix it. @vio-f