volcano icon indicating copy to clipboard operation
volcano copied to clipboard

1.4版本proportion插件queue的deserved资源计算逻辑问题

Open yalbaba opened this issue 1 year ago • 6 comments

目前问题:pod向queue提交资源申请会出现queue资源足够但是pod无法调度 case: 集群资源:cpu:2 内存:2 ScalarResources:"nvidia/gpu":8 q1 capability:cpu:0 内存:0 ScalarResources:"nvidia/gpu":0 weight:1 q2 capability:cpu:1 内存:1 ScalarResources:"nvidia/gpu":1 weight:1 q3 capability:cpu:2 内存:2 ScalarResources:"nvidia/gpu":7 weight:1

假设三个pod分别都向三个q申请资源 pod1 =》 q1 申请:cpu:0 内存:0 ScalarResources:"nvidia/gpu":0 pod2 =》q2 申请:cpu:1 内存:1 ScalarResources:"nvidia/gpu":1 pod3 =》q3 申请:cpu:2 内存:2 ScalarResources:"nvidia/gpu":7

在第一轮分配过程中 q1 deserved:cpu:0.66 内存:0.66 ScalarResources:"nvidia/gpu":2.6 此时发生问题的函数片段是:

// 有问题的代码 if attr.capability != nil && !attr.deserved.LessEqualStrict(attr.capability) { attr.deserved = helpers.Min(attr.deserved, attr.capability) attr.deserved = helpers.Min(attr.deserved, attr.request) meet[attr.queueID] = struct{}{} klog.V(4).Infof("queue <%s> is meet cause of the capability", attr.name) }

其中LessEqualStrict函数里: ` // 有问题的代码 func (r *Resource) LessEqualStrict(rr *Resource) bool { lessFunc := func(l, r float64) bool { return l <= r }

if !lessFunc(r.MilliCPU, rr.MilliCPU) {
	return false
}
if !lessFunc(r.Memory, rr.Memory) {
	return false
}

for rName, rQuant := range r.ScalarResources {
	if !lessFunc(rQuant, rr.ScalarResources[rName]) {
		return false
	}
}

return true

} ` 由于只有ScalarResources超过了capability,所以q1的分配就退出了,也就不参与下轮分配了,所以q1实得资源为:cpu:0.66 内存:0.66 gpu:2.66 ,因此接下里接入allocate环节,pod2无法启动,因为资源不足

yalbaba avatar Jun 21 '24 02:06 yalbaba

我认为这里的判断超过queue的capability时比较条件不完整,deserved > capability应该是所有维度的资源都超过才算

yalbaba avatar Jun 21 '24 02:06 yalbaba

have you tried v1.9.0?

Monokaix avatar Jun 21 '24 06:06 Monokaix

have you tried v1.9.0?

not yet ,the volcano upgrade may have an impact on our project

yalbaba avatar Jun 24 '24 07:06 yalbaba

have you tried v1.9.0? I wonder why this code was designed this way

yalbaba avatar Jun 24 '24 07:06 yalbaba

能给予解答吗?

yalbaba avatar Jun 28 '24 15:06 yalbaba

原始的这块代码是drf的实现逻辑,可用参考论文Dominant Resource Fairness: Fair Allocation of Multiple Resource Types 计算方法类似这种https://koordinator.sh/zh-Hans/docs/designs/multi-hierarchy-elastic-quota-management.

lowang-bh avatar Jun 29 '24 00:06 lowang-bh

Hello 👋 Looks like there was no activity on this issue for last 180 days. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity for 90 days, this issue will be closed (we can always reopen an issue if we need!).

stale[bot] avatar Apr 25 '25 23:04 stale[bot]