gpu-manager icon indicating copy to clipboard operation
gpu-manager copied to clipboard

A single GPU card may be oversold during scheduling

Open Lzhang-hub opened this issue 3 years ago • 5 comments

Through ip:5678/metric ,I get the number of containers on each GPU card and the amount of resources remaining on each card. I find there is oversold situation.

container info on each GPU card:

gpu0 gpu-bpwsbofhlu-container gpu0 gpu-khfzwabyui-container gpu0 gpu-ldlxmvitlt-container gpu0 gpu-wjjvqnhryg-container gpu1 gpu-gobchykabh-container gpu1 gpu-jeeqpuplap-container gpu2 gpu-adxgmluwud-container gpu2 gpu-xezknopszn-container gpu2 gpu-yftfckmqsq-container gpu3 gpu-drkofzmjzq-container gpu3 gpu-uckikqcwfv-container gpu4 gpu-fqzilpeybt-container gpu4 gpu-rkqsipucpj-container gpu5 gpu-ipfeqrjsob-container

resources remain on each GPU card:

[-97.0, 3.0, -25.0, -24.0, 17.0, 31.0, 59, 59]

In this case, GPU0 is allocated too many containers, but GPU6 and Gpu7 have no containers. could you give me some advices,Thanks!

Lzhang-hub avatar Jan 06 '22 02:01 Lzhang-hub

Kubelet doesn't allow oversell device resource, how was your calculation?

mYmNeo avatar Jan 06 '22 02:01 mYmNeo

I read metric from ip:5678, through container_gpu_memory_total can get all container name on each GPU , then sum container_request_gpu_memory on signal GPU can get the allocated value on each GPU card. I find the allocated sum is greater than the total resources of a signal GPU.

Lzhang-hub avatar Jan 06 '22 06:01 Lzhang-hub

Kubelet doesn't allow oversell device resource, how was your calculation?

I think kubelet doesn't allow oversell device resource is for k8s node, but in my case, oversell is happen on GPU card, k8s node resources is no oversell.

Lzhang-hub avatar Jan 06 '22 06:01 Lzhang-hub

gpu-manager will round up any memory size less than 256MB. Kubelet doesn't allow oversell, gpu-manager will not either.

mYmNeo avatar Jan 06 '22 11:01 mYmNeo

I read metric from ip:5678, through container_gpu_memory_total can get all container name on each GPU , then sum container_request_gpu_memory on signal GPU can get the allocated value on each GPU card. I find the allocated sum is greater than the total resources of a signal GPU.

Do you think it is right to calculate allocated tencent.com/vcuda-memory on one GPU through this method?

In my case, I calculate allocated tencent.com/vcuda-memory on one GPU is greater than all allocable tencent.com/vcuda-memory of on GPU.

Lzhang-hub avatar Jan 06 '22 12:01 Lzhang-hub