gpu-manager icon indicating copy to clipboard operation
gpu-manager copied to clipboard

Curious about how to determine the pod container for Allocate RPC in gpu-manager

Open ryzzn opened this issue 4 years ago • 6 comments

Hi guys, I have just gone through the code of Allocate function in gpu-manager, and feel curious why the selected pod is the right one for allocating. The logic seems look like as follows:

  1. List all pending pods which have GPU requirement.
  2. Sort pods by its predicating time.
  3. Find a pod which has container allocating the same count of GPU resources.

In my mind, the predicating time annotation can't guarantee as same order of pods to be bind to the node since the binding process runs concurrently. Besides kubelet should have its order to allocate resources for container(I'm not sure about it). So my doubt is that why your solution is right to select the corresponding pod.

Many thanks if I can get the answer.

ryzzn avatar Sep 24 '20 11:09 ryzzn

There's no guaranteed, and gpu-manager will validate the allocation result

mYmNeo avatar Nov 25 '20 07:11 mYmNeo

@mYmNeo how gpu-manager validate the result, preStartContainer?

I check the logic in preStartContainer, it gets pod uid in the checkpoint, and then get vcores and vmems from the cache, but both the checkpoint and cache is written by Allocate, it can be also mismatched

zwpaper avatar Dec 29 '20 06:12 zwpaper

@mYmNeo how gpu-manager validate the result, preStartContainer?

I check the logic in preStartContainer, it gets pod uid in the checkpoint, and then get vcores and vmems from the cache, but both the checkpoint and cache is written by Allocate, it can be also mismatched

In preStartContainer, gpu-manager will validate the assigned pod's data including card-idx, vcores, vmems to identify the container, if any of card-idx, vcore,vmems is not matched, reject this assigned pod to keep consistency.

mYmNeo avatar Dec 29 '20 10:12 mYmNeo

I still confused about this, let me try to explain myself:

This is the GPU Manager allocates and preCheck logic:

  1. get a candidate pod here, which is chosen by predicated time: https://github.com/tkestack/gpu-manager/blob/15b913864e4d24a5a5180da3aa3875acac70801c/pkg/services/allocator/nvidia/allocator.go#L692
  2. write the pod UID, container name, Devices IDs, vcore and vmem to checkpoint: https://github.com/tkestack/gpu-manager/blob/15b913864e4d24a5a5180da3aa3875acac70801c/pkg/services/allocator/nvidia/allocator.go#L525
  3. vmem do nothing when allocate: https://github.com/tkestack/gpu-manager/blob/15b913864e4d24a5a5180da3aa3875acac70801c/pkg/server/vmemory.go#L86
  4. vmem do nothing when preStart: https://github.com/tkestack/gpu-manager/blob/15b913864e4d24a5a5180da3aa3875acac70801c/pkg/server/vmemory.go#L106
  5. vcore get the checkpoint data here when preStart: https://github.com/tkestack/gpu-manager/blob/15b913864e4d24a5a5180da3aa3875acac70801c/pkg/services/allocator/nvidia/allocator.go#L794
  6. check podUID, containerName, vcore, vmemory here: https://github.com/tkestack/gpu-manager/blob/master/pkg/services/allocator/nvidia/allocator.go#L833

let's assume that, kubelet sent a request for podA, but we pick podB by mistake due to some reason, what's more, bot podA and podB has the same number for vcore and vmem, preStart can not detect the mistake here.

then this may happen:

  1. kubelet thinks podA has allocated and try to run it
  2. actually podA has not been allocated, so it fail
  3. kubelet retry podA allocation, but GPU Manager did not know the retry, it allocates for the next one
  4. podA fail again

I can not find out how could GPU Manager on earth, maybe I have some mistake, please point it out, thanks

zwpaper avatar Jan 05 '21 07:01 zwpaper

For gpu-manager, its allocation mechanism doesn't depend on the deviceID string, only the size of deviceID. So for your situation the pods have same vcore and vmem resource can be treated as same pod actually.

mYmNeo avatar Jan 06 '21 08:01 mYmNeo

but actually, they are not.

if kubelet and gpu manager chose a pod differently, podA and podB for example, then kubelet will fail when preStartContainer, and kubelet retry to request the resources for PodA, this time gpu-manager allocate the resource for podA.

but then, kubelet request resources for PodB, there will be no candidate pod for gpu manager to found! PodB will never make it to start

zwpaper avatar Jan 19 '21 10:01 zwpaper