gpu-manager A single GPU card may be oversold during scheduling

Through ip:5678/metric ,I get the number of containers on each GPU card and the amount of resources remaining on each card. I find there is oversold situation.

container info on each GPU card:

gpu0 gpu-bpwsbofhlu-container gpu0 gpu-khfzwabyui-container gpu0 gpu-ldlxmvitlt-container gpu0 gpu-wjjvqnhryg-container gpu1 gpu-gobchykabh-container gpu1 gpu-jeeqpuplap-container gpu2 gpu-adxgmluwud-container gpu2 gpu-xezknopszn-container gpu2 gpu-yftfckmqsq-container gpu3 gpu-drkofzmjzq-container gpu3 gpu-uckikqcwfv-container gpu4 gpu-fqzilpeybt-container gpu4 gpu-rkqsipucpj-container gpu5 gpu-ipfeqrjsob-container

resources remain on each GPU card:

[-97.0, 3.0, -25.0, -24.0, 17.0, 31.0, 59, 59]

In this case, GPU0 is allocated too many containers, but GPU6 and Gpu7 have no containers. could you give me some advices,Thanks!

Jan 06 '22 02:01 Lzhang-hub

Kubelet doesn't allow oversell device resource, how was your calculation?

Jan 06 '22 02:01 mYmNeo

I read metric from ip:5678, through container_gpu_memory_total can get all container name on each GPU , then sum container_request_gpu_memory on signal GPU can get the allocated value on each GPU card. I find the allocated sum is greater than the total resources of a signal GPU.

Jan 06 '22 06:01 Lzhang-hub

Kubelet doesn't allow oversell device resource, how was your calculation?

I think kubelet doesn't allow oversell device resource is for k8s node, but in my case, oversell is happen on GPU card, k8s node resources is no oversell.

Jan 06 '22 06:01 Lzhang-hub

gpu-manager will round up any memory size less than 256MB. Kubelet doesn't allow oversell, gpu-manager will not either.

Jan 06 '22 11:01 mYmNeo

I read metric from ip:5678, through container_gpu_memory_total can get all container name on each GPU , then sum container_request_gpu_memory on signal GPU can get the allocated value on each GPU card. I find the allocated sum is greater than the total resources of a signal GPU.

Do you think it is right to calculate allocated tencent.com/vcuda-memory on one GPU through this method?

In my case, I calculate allocated tencent.com/vcuda-memory on one GPU is greater than all allocable tencent.com/vcuda-memory of on GPU.

Jan 06 '22 12:01 Lzhang-hub

gpu-manager gpu-manager copied to clipboard

A single GPU card may be oversold during scheduling

gpu-manager
gpu-manager copied to clipboard