GPUMounter
GPUMounter copied to clipboard
Will support Cgroup v2 in the future ?
I test this successfully in Cgroup v1, but Cgroup v2 was wrong.
Here are the logs :
2025-01-13T03:41:56.696Z INFO GPUMounter-worker/main.go:15 Service Starting... 2025-01-13T03:41:56.696Z INFO gpu-mount/server.go:22 Creating gpu mounter 2025-01-13T03:41:56.696Z INFO allocator/allocator.go:28 Creating gpu allocator 2025-01-13T03:41:56.696Z INFO collector/collector.go:24 Creating gpu collector 2025-01-13T03:41:56.696Z INFO collector/collector.go:42 Start get gpu info 2025-01-13T03:41:56.704Z INFO collector/collector.go:53 GPU Num: 1 2025-01-13T03:41:56.710Z INFO collector/collector.go:91 Updating GPU status 2025-01-13T03:41:56.711Z INFO collector/collector.go:136 GPU status update successfully 2025-01-13T03:41:56.711Z INFO collector/collector.go:36 Successfully update gpu status 2025-01-13T03:41:56.711Z INFO allocator/allocator.go:35 Successfully created gpu collector 2025-01-13T03:41:56.711Z INFO gpu-mount/server.go:29 Successfully created gpu allocator 2025-01-13T03:41:56.711Z INFO GPUMounter-worker/main.go:22 Successfully created gpu mounter 2025-01-13T03:41:58.732Z INFO gpu-mount/server.go:35 AddGPU Service Called 2025-01-13T03:41:58.732Z INFO gpu-mount/server.go:36 request: pod_name:"owner-pod" namespace:"default" gpu_num:1 2025-01-13T03:41:58.750Z INFO gpu-mount/server.go:55 Successfully get Pod: default in cluster 2025-01-13T03:41:58.750Z INFO allocator/allocator.go:159 Get pod default/owner-pod mount type 2025-01-13T03:41:58.750Z INFO collector/collector.go:91 Updating GPU status 2025-01-13T03:41:58.750Z INFO collector/collector.go:136 GPU status update successfully 2025-01-13T03:41:58.758Z INFO allocator/allocator.go:59 Creating GPU Slave Pod: owner-pod-slave-pod-40a529 for Owner Pod: owner-pod 2025-01-13T03:41:58.758Z INFO allocator/allocator.go:239 Checking Pods: owner-pod-slave-pod-40a529 state 2025-01-13T03:41:58.760Z INFO allocator/allocator.go:265 Pod: owner-pod-slave-pod-40a529 creating 2025-01-13T03:41:58.762Z INFO allocator/allocator.go:265 Pod: owner-pod-slave-pod-40a529 creating 2025-01-13T03:41:58.763Z INFO allocator/allocator.go:265 Pod: owner-pod-slave-pod-40a529 creating 2025-01-13T03:41:58.765Z INFO allocator/allocator.go:265 Pod: owner-pod-slave-pod-40a529 creating 2025-01-13T03:42:00.142Z INFO allocator/allocator.go:278 Pods: owner-pod-slave-pod-40a529 are running 2025-01-13T03:42:00.142Z INFO allocator/allocator.go:84 Successfully create Slave Pod: owner-pod-slave-pod-40a529, for Owner Pod: owner-pod 2025-01-13T03:42:00.142Z INFO collector/collector.go:91 Updating GPU status 2025-01-13T03:42:00.143Z DEBUG collector/collector.go:130 GPU: /dev/nvidia0 allocated to Pod: owner-pod-slave-pod-40a529 in Namespace default 2025-01-13T03:42:00.143Z INFO collector/collector.go:136 GPU status update successfully 2025-01-13T03:42:00.143Z INFO gpu-mount/server.go:81 Start mounting, Total: 1 Current: 1 2025-01-13T03:42:00.143Z INFO util/util.go:19 Start mount GPU: {"MinorNumber":0,"DeviceFilePath":"/dev/nvidia0","UUID":"GPU-cf2f070a-ff5a-ee2b-16ac-047f7e9c16bb","State":"GPU_ALLOCATED_STATE","PodName":"owner-pod-slave-pod-40a529","Namespace":"default"} to Pod: owner-pod 2025-01-13T03:42:00.143Z INFO util/util.go:24 Pod :owner-pod container ID: a893d886e17a63b2b056cef9766df9ae2b0a1f130a2c6952ed029a0fe5b1b740 2025-01-13T03:42:00.143Z INFO util/util.go:35 Successfully get cgroup path: /kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod4fea5b3d_b5ff_4e7f_a0f6_31a9ba061196.slice/docker-a893d886e17a63b2b056cef9766df9ae2b0a1f130a2c6952ed029a0fe5b1b740.scope for Pod: owner-pod 2025-01-13T03:42:00.145Z ERROR cgroup/cgroup.go:148 Exec "echo 'c 195:0 rw' > /sys/fs/cgroup/devices/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod4fea5b3d_b5ff_4e7f_a0f6_31a9ba061196.slice/docker-a893d886e17a63b2b056cef9766df9ae2b0a1f130a2c6952ed029a0fe5b1b740.scope/devices.allow" failed 2025-01-13T03:42:00.145Z ERROR cgroup/cgroup.go:149 Output: sh: 1: cannot create /sys/fs/cgroup/devices/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod4fea5b3d_b5ff_4e7f_a0f6_31a9ba061196.slice/docker-a893d886e17a63b2b056cef9766df9ae2b0a1f130a2c6952ed029a0fe5b1b740.scope/devices.allow: Directory nonexistent
2025-01-13T03:42:00.145Z ERROR cgroup/cgroup.go:150 exit status 2 2025-01-13T03:42:00.145Z ERROR util/util.go:38 Add GPU {"MinorNumber":0,"DeviceFilePath":"/dev/nvidia0","UUID":"GPU-cf2f070a-ff5a-ee2b-16ac-047f7e9c16bb","State":"GPU_ALLOCATED_STATE","PodName":"owner-pod-slave-pod-40a529","Namespace":"default"}failed 2025-01-13T03:42:00.145Z ERROR gpu-mount/server.go:84 Mount GPU: {"MinorNumber":0,"DeviceFilePath":"/dev/nvidia0","UUID":"GPU-cf2f070a-ff5a-ee2b-16ac-047f7e9c16bb","State":"GPU_ALLOCATED_STATE","PodName":"owner-pod-slave-pod-40a529","Namespace":"default"} to Pod: owner-pod in Namespace: default failed 2025-01-13T03:42:00.145Z ERROR gpu-mount/server.go:85 exit status 2
I checked this filedir, there were no /sys/fs/cgroup/devices .