release-1.5 When modifying guarantee gpu number, CPU resource tasks cannot enter enqueque
What happened: Example,Set two queues in the cluster, A and B, cluster total resource:
cpu: 32
memory: 64Gi
nvidia.com/gpu: 8
queue A set for
capability:
cpu: "16"
memory: 32Gi
nvidia.com/gpu: "8"
guarantee:
resource:
cpu: 8
memory: 16Gi
nvidia.com/gpu: "2"
reclaimable: false
weight: 1
queue B set for
capability:
cpu: "16"
memory: 32Gi
nvidia.com/gpu: "8"
guarantee:
resource:
cpu: 8
memory: 16Gi
nvidia.com/gpu: "2"
reclaimable: false
weight: 1
A queues up 6 tasks, and each task resource is:
cpu: 1
memory: 1Gi
nvidia.com/gpu: "1"
Tasks are running。 Change the guarantee quantity of the GPU in QUEUE B to 4。 A queue If A CPU task T1 is executed, the enqueue cannot be entered. The task resources are as follows:
cpu: 1
memory: 1Gi
What you expected to happen: It is expected that task T1 can enter the Enqueue and be scheduled successfully
Environment:
- Volcano Version: release-1.5
- Kubernetes version (use
kubectl version):
Client Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.9", GitCommit:"94f372e501c973a7fa9eb40ec9ebd2fe7ca69848", GitTreeState:"clean", BuildDate:"2020-09-16T13:56:40Z", GoVersion:"go1.13.15", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.9", GitCommit:"94f372e501c973a7fa9eb40ec9ebd2fe7ca69848", GitTreeState:"clean", BuildDate:"2020-09-16T13:47:43Z", GoVersion:"go1.13.15", Compiler:"gc", Platform:"linux/amd64"}
- OS (e.g. from /etc/os-release):
NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"
CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"
- Kernel (e.g.
uname -a):Linux cce-7tpawg0f-h8v6ksrs 3.10.0-957.12.1.el7.x86_64 #1 SMP Mon Apr 29 14:59:59 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
/assign @qiankunli
for inqueue := minReq.Add(attr.allocated).Add(attr.inqueue).Sub(attr.elastic).LessEqual(attr.realCapability, api.Infinity), if we use (1,1,1) to represent (cpu=1,mem=1,gpu=1),
(1,1,0) + (6,6,6) + (0,0,0) - (0,0,0) = (7,7,6) < (8,16,2) ==> false, In fact, the remaining resources of the queue ((8,16,2) - (7,7,6) = (1,9,-4)) can run task with resources = (1,1,0). I thought about this at the time, but for simplicity of calculation, I didn't change it.
maybe we can
availableAndElasticResources := attr.realCapability.Sub(attr.allocated).Sub(attr.inqueue).Add(attr.elastic)
inqueue := availableAndElasticResources.canRun(minReq)
availableAndElasticResources := attr.realCapability.Sub(attr.allocated).Sub(attr.inqueue).Add(attr.elastic)
inqueue := availableAndElasticResources.canRun(minReq)
Panic may occur when using Sub
Hello 👋 Looks like there was no activity on this issue for last 90 days. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity for 60 days, this issue will be closed (we can always reopen an issue if we need!).
Closing for now as there was no activity for last 60 days after marked as stale, let us know if you need this to be reopened! 🤗