gpu-manager
gpu-manager copied to clipboard
Curious about how to determine the pod container for Allocate RPC in gpu-manager
Hi guys, I have just gone through the code of Allocate function in gpu-manager, and feel curious why the selected pod is the right one for allocating. The logic seems look like as follows:
- List all pending pods which have GPU requirement.
- Sort pods by its predicating time.
- Find a pod which has container allocating the same count of GPU resources.
In my mind, the predicating time annotation can't guarantee as same order of pods to be bind to the node since the binding process runs concurrently. Besides kubelet should have its order to allocate resources for container(I'm not sure about it). So my doubt is that why your solution is right to select the corresponding pod.
Many thanks if I can get the answer.
There's no guaranteed, and gpu-manager will validate the allocation result
@mYmNeo how gpu-manager validate the result, preStartContainer
?
I check the logic in preStartContainer
, it gets pod uid in the checkpoint, and then get vcores and vmems from the cache, but both the checkpoint and cache is written by Allocate
, it can be also mismatched
@mYmNeo how gpu-manager validate the result,
preStartContainer
?I check the logic in
preStartContainer
, it gets pod uid in the checkpoint, and then get vcores and vmems from the cache, but both the checkpoint and cache is written byAllocate
, it can be also mismatched
In preStartContainer
, gpu-manager will validate the assigned pod's data including card-idx, vcores, vmems to identify the container, if any of card-idx, vcore,vmems is not matched, reject this assigned pod to keep consistency.
I still confused about this, let me try to explain myself:
This is the GPU Manager allocates and preCheck logic:
- get a candidate pod here, which is chosen by predicated time: https://github.com/tkestack/gpu-manager/blob/15b913864e4d24a5a5180da3aa3875acac70801c/pkg/services/allocator/nvidia/allocator.go#L692
- write the
pod UID
,container name
,Devices IDs
,vcore
andvmem
to checkpoint: https://github.com/tkestack/gpu-manager/blob/15b913864e4d24a5a5180da3aa3875acac70801c/pkg/services/allocator/nvidia/allocator.go#L525 -
vmem
do nothing when allocate: https://github.com/tkestack/gpu-manager/blob/15b913864e4d24a5a5180da3aa3875acac70801c/pkg/server/vmemory.go#L86 -
vmem
do nothing when preStart: https://github.com/tkestack/gpu-manager/blob/15b913864e4d24a5a5180da3aa3875acac70801c/pkg/server/vmemory.go#L106 -
vcore
get the checkpoint data here when preStart: https://github.com/tkestack/gpu-manager/blob/15b913864e4d24a5a5180da3aa3875acac70801c/pkg/services/allocator/nvidia/allocator.go#L794 - check podUID, containerName, vcore, vmemory here: https://github.com/tkestack/gpu-manager/blob/master/pkg/services/allocator/nvidia/allocator.go#L833
let's assume that, kubelet sent a request for podA
, but we pick podB
by mistake due to some reason, what's more, bot podA
and podB
has the same number for vcore
and vmem
, preStart
can not detect the mistake here.
then this may happen:
- kubelet thinks
podA
has allocated and try to run it - actually
podA
has not been allocated, so it fail - kubelet retry
podA
allocation, butGPU Manager
did not know the retry, it allocates for the next one -
podA
fail again
I can not find out how could GPU Manager
on earth, maybe I have some mistake, please point it out, thanks
For gpu-manager, its allocation mechanism doesn't depend on the deviceID string, only the size of deviceID. So for your situation the pods have same vcore
and vmem
resource can be treated as same pod actually.
but actually, they are not.
if kubelet and gpu manager chose a pod differently, podA
and podB
for example, then kubelet will fail when preStartContainer
,
and kubelet retry to request the resources for PodA
, this time gpu-manager allocate the resource for podA.
but then, kubelet request resources for PodB
, there will be no candidate pod for gpu manager to found! PodB
will never make it to start