volcano Scheduler do not work cause checkHierarchicalQueue failed when GPU dropped

Description

Currently, we use hierarchical queues to achieve resource isolation in multi-tenant scenarios. We’ve encountered a scenario that could lead to significant stability issues recently. For example: We have the following hierarchical queue.

root (capacity: 32 H100 = cluster total, 4 nodes every 8 cards)
├── child-queue-a (capacity: 32)
└── child-queue-b (capacity: 10)
  └── subchild-queue-a (capacity: 5)
....

The quotas for ChildQueue and Subchild Queue are declared by tenants based on the total capacity of the resource pool.

At first，the whole system works well. At a certain point, one of the physical nodes experienced a GPU card failure—a fairly common fault scenario. This caused the GPU resources reported by the node to decrease from 8 to 7. Because root queue is calculated from node capacity, then cap of root queue decreased from 32 to 31. Due to the validation of checkHierarchicalQueue, capacity plugin blocks when the tree data structure has been destroyed, which may cause prevent all tenants' computing workloads from being scheduled. From a business perspective, even if a GPU card failure occurs, child-queue-b and subchild-queue-a should still work normally—this is because the entire resource pool still has sufficient GPU resources.

This may have two solutions：

Currently child queue and subchild queue is declared by the business itself. They can be maintained independently by the business's controller to ensure the stability of their tree structure. However, root queue is not the same. It is limited by the node's resource reporting chain, including components such as physical devices and device plugins. It may be an option to handle the root queue specially, for example, skip root queue check in checkHierarchicalQueue and another serveral functions. But I don't think it's a good solution.
The core issue is that currently, we allow users to set the Capacity for the root queue themselves, but in checkHierarchicalQueue, we actually still use realCapacity, which in turn comes from the sum of the actual resources of the nodes. I hope to add an option to directly use the set Cap as realCapacity for calculations.

Steps to reproduce the issue

Describe the results you received and expected

Tasks can be enqueued normally when resources are sufficient.

What version of Volcano are you using?

master

Any other relevant information

No response

Oct 20 '25 15:10 Poor12

cc @Monokaix @JesseStutler for opinion

Oct 20 '25 15:10 Poor12

This PR would solve the issue too: https://github.com/volcano-sh/volcano/pull/4662

Oct 23 '25 12:10 hajnalmt

This PR would solve the issue too: #4662 Yes I think this PR can be resolved by adding an option to choose whether to use capability to override realCapability. This capability can be manually configured. The scenario for this PR originated from auto scale, but the solution is the same

@Poor12

Oct 24 '25 10:10 JesseStutler

https://github.com/volcano-sh/volcano/pull/4662 这个PR是不是仅仅解决了capability的问题？deserve值有同样的问题需要处理。

Dec 03 '25 06:12 zhaizhicheng