Scheduler do not work cause checkHierarchicalQueue failed when GPU dropped
Description
Currently, we use hierarchical queues to achieve resource isolation in multi-tenant scenarios. We’ve encountered a scenario that could lead to significant stability issues recently. For example: We have the following hierarchical queue.
root (capacity: 32 H100 = cluster total, 4 nodes every 8 cards)
├── child-queue-a (capacity: 32)
└── child-queue-b (capacity: 10)
└── subchild-queue-a (capacity: 5)
....
The quotas for ChildQueue and Subchild Queue are declared by tenants based on the total capacity of the resource pool.
At first,the whole system works well. At a certain point, one of the physical nodes experienced a GPU card failure—a fairly common fault scenario. This caused the GPU resources reported by the node to decrease from 8 to 7. Because root queue is calculated from node capacity, then cap of root queue decreased from 32 to 31. Due to the validation of checkHierarchicalQueue, capacity plugin blocks when the tree data structure has been destroyed, which may cause prevent all tenants' computing workloads from being scheduled. From a business perspective, even if a GPU card failure occurs, child-queue-b and subchild-queue-a should still work normally—this is because the entire resource pool still has sufficient GPU resources.
This may have two solutions:
- Currently child queue and subchild queue is declared by the business itself. They can be maintained independently by the business's controller to ensure the stability of their tree structure. However, root queue is not the same. It is limited by the node's resource reporting chain, including components such as physical devices and device plugins. It may be an option to handle the root queue specially, for example, skip root queue check in
checkHierarchicalQueueand another serveral functions. But I don't think it's a good solution. - The core issue is that currently, we allow users to set the Capacity for the root queue themselves, but in checkHierarchicalQueue, we actually still use realCapacity, which in turn comes from the sum of the actual resources of the nodes. I hope to add an option to directly use the set Cap as realCapacity for calculations.
Steps to reproduce the issue
Describe the results you received and expected
Tasks can be enqueued normally when resources are sufficient.
What version of Volcano are you using?
master
Any other relevant information
No response
cc @Monokaix @JesseStutler for opinion
This PR would solve the issue too: https://github.com/volcano-sh/volcano/pull/4662
This PR would solve the issue too: #4662 Yes I think this PR can be resolved by adding an option to choose whether to use capability to override realCapability. This capability can be manually configured. The scenario for this PR originated from auto scale, but the solution is the same
@Poor12
https://github.com/volcano-sh/volcano/pull/4662 这个PR是不是仅仅解决了capability的问题?deserve值有同样的问题需要处理。