The active pods considered by gpu-admission and gpu-manager are inconsistent

Open fighterhit opened this issue 4 years ago • 0 comments

Hi @mYmNeo , I sometimes find that if a GPU pod is created while some GPU pods are being deleted or terminating, the UnexpectedAdmissionError will appear a little more frequently. I observed that the logic for gpu-admission to get active GPU pods on a node is different from that of gpu-manager. When gpu-admission get active pods, it seems to think the pods being deleted still occupies the GPUs, but gpu-manager will excludes these pods. So I think maybe their logic for getting active pods should also be consistent to reduce the occurrence of UnexpectedAdmissionError caused by inconsistent GPU selection.

gpu-admission: https://github.com/tkestack/gpu-admission/blob/47d56ae99ef7f24f2c9c4d33d17567e2e52f3ba2/pkg/predicate/gpu_predicate.go#L213-L215
gpu-manager: https://github.com/tkestack/gpu-manager/blob/c961e77c3e65ef68299d0ba8ccb945b063896a03/pkg/services/watchdog/watchdog.go#L137-L138

Dec 29 '21 12:12 fighterhit