gpu-manager
                                
                                 gpu-manager copied to clipboard
                                
                                    gpu-manager copied to clipboard
                            
                            
                            
                        Curious about how to determine the pod container for Allocate RPC in gpu-manager
Hi guys, I have just gone through the code of Allocate function in gpu-manager, and feel curious why the selected pod is the right one for allocating. The logic seems look like as follows:
- List all pending pods which have GPU requirement.
- Sort pods by its predicating time.
- Find a pod which has container allocating the same count of GPU resources.
In my mind, the predicating time annotation can't guarantee as same order of pods to be bind to the node since the binding process runs concurrently. Besides kubelet should have its order to allocate resources for container(I'm not sure about it). So my doubt is that why your solution is right to select the corresponding pod.
Many thanks if I can get the answer.
There's no guaranteed, and gpu-manager will validate the allocation result
@mYmNeo how gpu-manager validate the result, preStartContainer?
I check the logic in preStartContainer, it gets pod uid in the checkpoint, and then get vcores and vmems from the cache, but both the checkpoint and cache is written by Allocate, it can be also mismatched
@mYmNeo how gpu-manager validate the result,
preStartContainer?I check the logic in
preStartContainer, it gets pod uid in the checkpoint, and then get vcores and vmems from the cache, but both the checkpoint and cache is written byAllocate, it can be also mismatched
In preStartContainer, gpu-manager will validate the assigned pod's data including card-idx, vcores, vmems to identify the  container, if any of card-idx, vcore,vmems is not matched, reject this assigned pod to keep consistency.
I still confused about this, let me try to explain myself:
This is the GPU Manager allocates and preCheck logic:
- get a candidate pod here, which is chosen by predicated time: https://github.com/tkestack/gpu-manager/blob/15b913864e4d24a5a5180da3aa3875acac70801c/pkg/services/allocator/nvidia/allocator.go#L692
- write the pod UID,container name,Devices IDs,vcoreandvmemto checkpoint: https://github.com/tkestack/gpu-manager/blob/15b913864e4d24a5a5180da3aa3875acac70801c/pkg/services/allocator/nvidia/allocator.go#L525
- vmemdo nothing when allocate: https://github.com/tkestack/gpu-manager/blob/15b913864e4d24a5a5180da3aa3875acac70801c/pkg/server/vmemory.go#L86
- vmemdo nothing when preStart: https://github.com/tkestack/gpu-manager/blob/15b913864e4d24a5a5180da3aa3875acac70801c/pkg/server/vmemory.go#L106
- vcoreget the checkpoint data here when preStart: https://github.com/tkestack/gpu-manager/blob/15b913864e4d24a5a5180da3aa3875acac70801c/pkg/services/allocator/nvidia/allocator.go#L794
- check podUID, containerName, vcore, vmemory here: https://github.com/tkestack/gpu-manager/blob/master/pkg/services/allocator/nvidia/allocator.go#L833
let's assume that, kubelet sent a request for podA, but we pick podB by mistake due to some reason, what's more, bot podA and podB has the same number for vcore and vmem, preStart can not detect the mistake here.
then this may happen:
- kubelet thinks podAhas allocated and try to run it
- actually podAhas not been allocated, so it fail
- kubelet retry podAallocation, butGPU Managerdid not know the retry, it allocates for the next one
- podAfail again
I can not find out how could GPU Manager on earth, maybe I have some mistake, please point it out, thanks
For gpu-manager, its allocation mechanism doesn't depend on the deviceID string, only the size of deviceID. So for your situation the pods have same vcore and vmem resource can be treated as same pod actually.
but actually, they are not.
if kubelet and gpu manager chose a pod differently, podA and podB for example, then kubelet will fail when preStartContainer,
and kubelet retry to request the resources for PodA, this time gpu-manager allocate the resource for podA.
but then, kubelet request resources for PodB, there will be no candidate pod for gpu manager to found! PodB will never make it to start