GPUMounter
GPUMounter copied to clipboard
Can not use GPUMounter on k8s
environment:
- k8s 1.16.15
- docker 20.10.10
problem: following QuickStart.md, I install GPUMounter successfully in my k8s. However, never request remove gpu
and add gpu
sucessfully.
I pasted some logs from gpu-mounter-master-container:
remove gpu 2022-02-18T03:44:55.184Z INFO GPUMounter-master/main.go:120 access remove gpu service 2022-02-18T03:44:55.184Z INFO GPUMounter-master/main.go:134 GPU-5d237016-9ea5-77bd-8c2f-2b3fd4bfa2cd 2022-02-18T03:44:55.184Z INFO GPUMounter-master/main.go:135 GPU-5d237016-9ea5-77bd-8c2f-2b3fd4bfa2cd 2022-02-18T03:44:55.184Z INFO GPUMounter-master/main.go:146 Pod: jupyter-lab-54d76f5d58-rlklh Namespace: default UUIDs: GPU-5d237016-9ea5-77bd-8c2f-2b3fd4bfa2cd force: true 2022-02-18T03:44:55.188Z INFO GPUMounter-master/main.go:169 Found Pod: jupyter-lab-54d76f5d58-rlklh in Namespace: default on Node: dev06.ucd.qzm.stonewise.cn 2022-02-18T03:44:55.193Z INFO GPUMounter-master/main.go:265 Worker: gpu-mounter-workers-fbfj8 Node: dev05.ucd.qzm.stonewise.cn 2022-02-18T03:44:55.193Z INFO GPUMounter-master/main.go:265 Worker: gpu-mounter-workers-kwmsn Node: dev06.ucd.qzm.stonewise.cn 2022-02-18T03:44:55.201Z ERROR GPUMounter-master/main.go:217 Invalid UUIDs: GPU-5d237016-9ea5-77bd-8c2f-2b3fd4bfa2cd
add gpu 2022-02-18T03:42:22.897Z INFO GPUMounter-master/main.go:25 access add gpu service 2022-02-18T03:42:22.898Z INFO GPUMounter-master/main.go:30 Pod: jupyter-lab-54d76f5d58-rlklh Namespace: default GPU Num: 4 Is entire mount: false 2022-02-18T03:42:22.902Z INFO GPUMounter-master/main.go:66 Found Pod: jupyter-lab-54d76f5d58-rlklh in Namespace: default on Node: dev06.ucd.qzm.stonewise.cn 2022-02-18T03:42:22.907Z INFO GPUMounter-master/main.go:265 Worker: gpu-mounter-workers-fbfj8 Node: dev05.ucd.qzm.stonewise.cn 2022-02-18T03:42:22.907Z INFO GPUMounter-master/main.go:265 Worker: gpu-mounter-workers-kwmsn Node: dev06.ucd.qzm.stonewise.cn 2022-02-18T03:42:22.921Z ERROR GPUMounter-master/main.go:98 Failed to call add gpu service 2022-02-18T03:42:22.921Z ERROR GPUMounter-master/main.go:99 rpc error: code = Unknown desc = FailedCreated
do not have slave pod in my namespace: gpu-pool
@liuweibin6566396837
Thanks for your issue.
Show more relevant logs of gpu-mounter-worker(/etc/GPUMounter/log/GPUMounter-worker.log
) plz.
It seems like that you edit the k8s version in this issue. What's your k8s version? In current version, GPUMounter has a known bug on k8s v1.20+ mentioned in https://github.com/pokerfaceSad/GPUMounter/issues/19#issuecomment-1034134013.
It seems like that you edit the k8s version in this issue. What's your k8s version? In current version, GPUMounter has a known bug on k8s v1.20+ mentioned in #19 (comment).
thanks for your reply. I have fixed the problem earlier(just make sure env: - name: NVIDIA_VISIBLE_DEVICES value: "none" the problem will be solved. )
It seems like that you edit the k8s version in this issue. What's your k8s version? In current version, GPUMounter has a known bug on k8s v1.20+ mentioned in #19 (comment).
Now, I met a new bug in cluster k8s 1.20.11 docker 20.10.10
bug: when I request addgpu, it return "Add GPU Success", however no slaver pod in gpu-pool. I found some unusual log in worker's pod, show some logs follows:
2022-03-15T13:12:43.240Z INFO collector/collector.go:136 GPU status update successfully 2022-03-15T13:12:46.402Z INFO allocator/allocator.go:59 Creating GPU Slave Pod: base-0-slave-pod-595282 for Owner Pod: base-0 2022-03-15T13:12:46.403Z INFO allocator/allocator.go:238 Checking Pods: base-0-slave-pod-595282 state 2022-03-15T13:12:50.450Z INFO allocator/allocator.go:252 Not Found.... 2022-03-15T13:12:50.450Z INFO allocator/allocator.go:277 Pods: base-0-slave-pod-595282 are running 2022-03-15T13:12:50.450Z INFO allocator/allocator.go:84 Successfully create Slave Pod: base-0-slave-pod-595282, for Owner Pod: base-0 2022-03-15T13:12:50.450Z INFO collector/collector.go:91 Updating GPU status 2022-03-15T13:12:50.452Z DEBUG collector/collector.go:130 GPU: /dev/nvidia0 allocated to Pod: xiaoxuan-fbdd-0 in Namespace shixiaoxuan 2022-03-15T13:12:50.452Z DEBUG collector/collector.go:130 GPU: /dev/nvidia1 allocated to Pod: zwbgpu-pytorch-1-6-0 in Namespace zhouwenbiao 2022-03-15T13:12:50.452Z DEBUG collector/collector.go:130 GPU: /dev/nvidia7 allocated to Pod: admet-predict-0 in Namespace liqinze 2022-03-15T13:12:50.452Z DEBUG collector/collector.go:130 GPU: /dev/nvidia5 allocated to Pod: xiaoxuan-test-d2m-0 in Namespace shixiaoxuan 2022-03-15T13:12:50.452Z DEBUG collector/collector.go:130 GPU: /dev/nvidia2 allocated to Pod: bf-dev-2-0 in Namespace baifang 2022-03-15T13:12:50.452Z DEBUG collector/collector.go:130 GPU: /dev/nvidia3 allocated to Pod: bf-dev-2-0 in Namespace baifang 2022-03-15T13:12:50.452Z DEBUG collector/collector.go:130 GPU: /dev/nvidia6 allocated to Pod: minisomdimsanbai-0 in Namespace yangdeai 2022-03-15T13:12:50.452Z INFO collector/collector.go:136 GPU status update successfully 2022-03-15T13:12:50.452Z INFO gpu-mount/server.go:97 Successfully mount all GPU to Pod: base-0 in Namespace: liuweibin
Thx for your report. It seems that you have the unfixed issue mentioned in #19. GPUMounter can not work well in k8s v1.20+ in current version.