Improve performance in condition of large scale pod allocation

Open JasonHe-WQ opened this issue 1 year ago • 0 comments

Improve performance in condition of large scale pod allocation

1. Issue or feature description

Users reported that in case of large allocation, the HAMi scheduler would surfer performance downgrade. They further discovered the function LockNode and ReleaseNodeLock would cause too many retry. And the solution would be change the lock granularity to GPU uuid.

e.g. map[GpuUUID]Lock instead of LockNode

https://github.com/Project-HAMi/HAMi/blob/8b5e5b88e75a68019c46a2caaa05e1995744a13d/pkg/device/nvidia/device.go#L88

2. Steps to reproduce the issue

Allocate hundreds of pods at one time.

3. Information to attach (optional if deemed irrelevant)

User Feed Back in the wechat group of Project Volcano

Jul 05 '24 02:07 JasonHe-WQ