gpu-manager
gpu-manager copied to clipboard
A single GPU card may be oversold during scheduling
Through ip:5678/metric ,I get the number of containers on each GPU card and the amount of resources remaining on each card. I find there is oversold situation.
container info on each GPU card:
gpu0 gpu-bpwsbofhlu-container gpu0 gpu-khfzwabyui-container gpu0 gpu-ldlxmvitlt-container gpu0 gpu-wjjvqnhryg-container gpu1 gpu-gobchykabh-container gpu1 gpu-jeeqpuplap-container gpu2 gpu-adxgmluwud-container gpu2 gpu-xezknopszn-container gpu2 gpu-yftfckmqsq-container gpu3 gpu-drkofzmjzq-container gpu3 gpu-uckikqcwfv-container gpu4 gpu-fqzilpeybt-container gpu4 gpu-rkqsipucpj-container gpu5 gpu-ipfeqrjsob-container
resources remain on each GPU card:
[-97.0, 3.0, -25.0, -24.0, 17.0, 31.0, 59, 59]
In this case, GPU0 is allocated too many containers, but GPU6 and Gpu7 have no containers. could you give me some advices,Thanks!
Kubelet doesn't allow oversell device resource, how was your calculation?
I read metric from ip:5678, through container_gpu_memory_total can get all container name on each GPU , then sum container_request_gpu_memory on signal GPU can get the allocated value on each GPU card.
I find the allocated sum is greater than the total resources of a signal GPU.
Kubelet doesn't allow oversell device resource, how was your calculation?
I think kubelet doesn't allow oversell device resource is for k8s node, but in my case, oversell is happen on GPU card, k8s node resources is no oversell.
gpu-manager will round up any memory size less than 256MB. Kubelet doesn't allow oversell, gpu-manager will not either.
I read metric from ip:5678, through
container_gpu_memory_totalcan get all container name on each GPU , then sumcontainer_request_gpu_memoryon signal GPU can get the allocated value on each GPU card. I find the allocated sum is greater than the total resources of a signal GPU.
Do you think it is right to calculate allocated tencent.com/vcuda-memory on one GPU through this method?
In my case, I calculate allocated tencent.com/vcuda-memory on one GPU is greater than all allocable tencent.com/vcuda-memory of on GPU.