1.4版本proportion插件queue的deserved资源计算逻辑问题
目前问题:pod向queue提交资源申请会出现queue资源足够但是pod无法调度 case: 集群资源:cpu:2 内存:2 ScalarResources:"nvidia/gpu":8 q1 capability:cpu:0 内存:0 ScalarResources:"nvidia/gpu":0 weight:1 q2 capability:cpu:1 内存:1 ScalarResources:"nvidia/gpu":1 weight:1 q3 capability:cpu:2 内存:2 ScalarResources:"nvidia/gpu":7 weight:1
假设三个pod分别都向三个q申请资源 pod1 =》 q1 申请:cpu:0 内存:0 ScalarResources:"nvidia/gpu":0 pod2 =》q2 申请:cpu:1 内存:1 ScalarResources:"nvidia/gpu":1 pod3 =》q3 申请:cpu:2 内存:2 ScalarResources:"nvidia/gpu":7
在第一轮分配过程中 q1 deserved:cpu:0.66 内存:0.66 ScalarResources:"nvidia/gpu":2.6 此时发生问题的函数片段是:
// 有问题的代码 if attr.capability != nil && !attr.deserved.LessEqualStrict(attr.capability) { attr.deserved = helpers.Min(attr.deserved, attr.capability) attr.deserved = helpers.Min(attr.deserved, attr.request) meet[attr.queueID] = struct{}{} klog.V(4).Infof("queue <%s> is meet cause of the capability", attr.name) }
其中LessEqualStrict函数里: ` // 有问题的代码 func (r *Resource) LessEqualStrict(rr *Resource) bool { lessFunc := func(l, r float64) bool { return l <= r }
if !lessFunc(r.MilliCPU, rr.MilliCPU) {
return false
}
if !lessFunc(r.Memory, rr.Memory) {
return false
}
for rName, rQuant := range r.ScalarResources {
if !lessFunc(rQuant, rr.ScalarResources[rName]) {
return false
}
}
return true
} ` 由于只有ScalarResources超过了capability,所以q1的分配就退出了,也就不参与下轮分配了,所以q1实得资源为:cpu:0.66 内存:0.66 gpu:2.66 ,因此接下里接入allocate环节,pod2无法启动,因为资源不足
我认为这里的判断超过queue的capability时比较条件不完整,deserved > capability应该是所有维度的资源都超过才算
have you tried v1.9.0?
have you tried v1.9.0?
not yet ,the volcano upgrade may have an impact on our project
have you tried v1.9.0? I wonder why this code was designed this way
能给予解答吗?
原始的这块代码是drf的实现逻辑,可用参考论文Dominant Resource Fairness: Fair Allocation of Multiple Resource Types
计算方法类似这种https://koordinator.sh/zh-Hans/docs/designs/multi-hierarchy-elastic-quota-management.
Hello 👋 Looks like there was no activity on this issue for last 180 days. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity for 90 days, this issue will be closed (we can always reopen an issue if we need!).