GPUMounter icon indicating copy to clipboard operation
GPUMounter copied to clipboard

Can not use GPUMounter on k8s

Open Crazybean-lwb opened this issue 3 years ago • 6 comments

environment:

  • k8s 1.16.15
  • docker 20.10.10

problem: following QuickStart.md, I install GPUMounter successfully in my k8s. However, never request remove gpu and add gpu sucessfully.

I pasted some logs from gpu-mounter-master-container:

remove gpu 2022-02-18T03:44:55.184Z INFO GPUMounter-master/main.go:120 access remove gpu service 2022-02-18T03:44:55.184Z INFO GPUMounter-master/main.go:134 GPU-5d237016-9ea5-77bd-8c2f-2b3fd4bfa2cd 2022-02-18T03:44:55.184Z INFO GPUMounter-master/main.go:135 GPU-5d237016-9ea5-77bd-8c2f-2b3fd4bfa2cd 2022-02-18T03:44:55.184Z INFO GPUMounter-master/main.go:146 Pod: jupyter-lab-54d76f5d58-rlklh Namespace: default UUIDs: GPU-5d237016-9ea5-77bd-8c2f-2b3fd4bfa2cd force: true 2022-02-18T03:44:55.188Z INFO GPUMounter-master/main.go:169 Found Pod: jupyter-lab-54d76f5d58-rlklh in Namespace: default on Node: dev06.ucd.qzm.stonewise.cn 2022-02-18T03:44:55.193Z INFO GPUMounter-master/main.go:265 Worker: gpu-mounter-workers-fbfj8 Node: dev05.ucd.qzm.stonewise.cn 2022-02-18T03:44:55.193Z INFO GPUMounter-master/main.go:265 Worker: gpu-mounter-workers-kwmsn Node: dev06.ucd.qzm.stonewise.cn 2022-02-18T03:44:55.201Z ERROR GPUMounter-master/main.go:217 Invalid UUIDs: GPU-5d237016-9ea5-77bd-8c2f-2b3fd4bfa2cd

add gpu 2022-02-18T03:42:22.897Z INFO GPUMounter-master/main.go:25 access add gpu service 2022-02-18T03:42:22.898Z INFO GPUMounter-master/main.go:30 Pod: jupyter-lab-54d76f5d58-rlklh Namespace: default GPU Num: 4 Is entire mount: false 2022-02-18T03:42:22.902Z INFO GPUMounter-master/main.go:66 Found Pod: jupyter-lab-54d76f5d58-rlklh in Namespace: default on Node: dev06.ucd.qzm.stonewise.cn 2022-02-18T03:42:22.907Z INFO GPUMounter-master/main.go:265 Worker: gpu-mounter-workers-fbfj8 Node: dev05.ucd.qzm.stonewise.cn 2022-02-18T03:42:22.907Z INFO GPUMounter-master/main.go:265 Worker: gpu-mounter-workers-kwmsn Node: dev06.ucd.qzm.stonewise.cn 2022-02-18T03:42:22.921Z ERROR GPUMounter-master/main.go:98 Failed to call add gpu service 2022-02-18T03:42:22.921Z ERROR GPUMounter-master/main.go:99 rpc error: code = Unknown desc = FailedCreated

Crazybean-lwb avatar Feb 18 '22 04:02 Crazybean-lwb

do not have slave pod in my namespace: gpu-pool

Crazybean-lwb avatar Feb 18 '22 06:02 Crazybean-lwb

@liuweibin6566396837 Thanks for your issue. Show more relevant logs of gpu-mounter-worker(/etc/GPUMounter/log/GPUMounter-worker.log) plz.

pokerfaceSad avatar Feb 18 '22 09:02 pokerfaceSad

It seems like that you edit the k8s version in this issue. What's your k8s version? In current version, GPUMounter has a known bug on k8s v1.20+ mentioned in https://github.com/pokerfaceSad/GPUMounter/issues/19#issuecomment-1034134013.

pokerfaceSad avatar Feb 18 '22 09:02 pokerfaceSad

It seems like that you edit the k8s version in this issue. What's your k8s version? In current version, GPUMounter has a known bug on k8s v1.20+ mentioned in #19 (comment).

thanks for your reply. I have fixed the problem earlier(just make sure env: - name: NVIDIA_VISIBLE_DEVICES value: "none" the problem will be solved. )

Crazybean-lwb avatar Mar 16 '22 09:03 Crazybean-lwb

It seems like that you edit the k8s version in this issue. What's your k8s version? In current version, GPUMounter has a known bug on k8s v1.20+ mentioned in #19 (comment).

Now, I met a new bug in cluster k8s 1.20.11 docker 20.10.10

bug: when I request addgpu, it return "Add GPU Success", however no slaver pod in gpu-pool. I found some unusual log in worker's pod, show some logs follows:

2022-03-15T13:12:43.240Z INFO collector/collector.go:136 GPU status update successfully 2022-03-15T13:12:46.402Z INFO allocator/allocator.go:59 Creating GPU Slave Pod: base-0-slave-pod-595282 for Owner Pod: base-0 2022-03-15T13:12:46.403Z INFO allocator/allocator.go:238 Checking Pods: base-0-slave-pod-595282 state 2022-03-15T13:12:50.450Z INFO allocator/allocator.go:252 Not Found.... 2022-03-15T13:12:50.450Z INFO allocator/allocator.go:277 Pods: base-0-slave-pod-595282 are running 2022-03-15T13:12:50.450Z INFO allocator/allocator.go:84 Successfully create Slave Pod: base-0-slave-pod-595282, for Owner Pod: base-0 2022-03-15T13:12:50.450Z INFO collector/collector.go:91 Updating GPU status 2022-03-15T13:12:50.452Z DEBUG collector/collector.go:130 GPU: /dev/nvidia0 allocated to Pod: xiaoxuan-fbdd-0 in Namespace shixiaoxuan 2022-03-15T13:12:50.452Z DEBUG collector/collector.go:130 GPU: /dev/nvidia1 allocated to Pod: zwbgpu-pytorch-1-6-0 in Namespace zhouwenbiao 2022-03-15T13:12:50.452Z DEBUG collector/collector.go:130 GPU: /dev/nvidia7 allocated to Pod: admet-predict-0 in Namespace liqinze 2022-03-15T13:12:50.452Z DEBUG collector/collector.go:130 GPU: /dev/nvidia5 allocated to Pod: xiaoxuan-test-d2m-0 in Namespace shixiaoxuan 2022-03-15T13:12:50.452Z DEBUG collector/collector.go:130 GPU: /dev/nvidia2 allocated to Pod: bf-dev-2-0 in Namespace baifang 2022-03-15T13:12:50.452Z DEBUG collector/collector.go:130 GPU: /dev/nvidia3 allocated to Pod: bf-dev-2-0 in Namespace baifang 2022-03-15T13:12:50.452Z DEBUG collector/collector.go:130 GPU: /dev/nvidia6 allocated to Pod: minisomdimsanbai-0 in Namespace yangdeai 2022-03-15T13:12:50.452Z INFO collector/collector.go:136 GPU status update successfully 2022-03-15T13:12:50.452Z INFO gpu-mount/server.go:97 Successfully mount all GPU to Pod: base-0 in Namespace: liuweibin

Crazybean-lwb avatar Mar 16 '22 09:03 Crazybean-lwb

Thx for your report. It seems that you have the unfixed issue mentioned in #19. GPUMounter can not work well in k8s v1.20+ in current version.

pokerfaceSad avatar Mar 18 '22 15:03 pokerfaceSad