volcano icon indicating copy to clipboard operation
volcano copied to clipboard

release-1.5 When modifying guarantee gpu number, CPU resource tasks cannot enter enqueque

Open hansongChina opened this issue 3 years ago • 4 comments

What happened: Example,Set two queues in the cluster, A and B, cluster total resource:

cpu: 32
memory: 64Gi
nvidia.com/gpu: 8

queue A set for

  capability:
    cpu: "16"
    memory: 32Gi
    nvidia.com/gpu: "8"
  guarantee:
    resource:
      cpu: 8
      memory: 16Gi
      nvidia.com/gpu: "2"
  reclaimable: false
  weight: 1

queue B set for

  capability:
    cpu: "16"
    memory: 32Gi
    nvidia.com/gpu: "8"
  guarantee:
    resource:
      cpu: 8
      memory: 16Gi
      nvidia.com/gpu: "2"
  reclaimable: false
  weight: 1

A queues up 6 tasks, and each task resource is:

      cpu: 1
      memory: 1Gi
      nvidia.com/gpu: "1"

Tasks are running。 Change the guarantee quantity of the GPU in QUEUE B to 4。 A queue If A CPU task T1 is executed, the enqueue cannot be entered. The task resources are as follows:

   cpu: 1
   memory: 1Gi

What you expected to happen: It is expected that task T1 can enter the Enqueue and be scheduled successfully

Environment:

  • Volcano Version: release-1.5
  • Kubernetes version (use kubectl version):
Client Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.9", GitCommit:"94f372e501c973a7fa9eb40ec9ebd2fe7ca69848", GitTreeState:"clean", BuildDate:"2020-09-16T13:56:40Z", GoVersion:"go1.13.15", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.9", GitCommit:"94f372e501c973a7fa9eb40ec9ebd2fe7ca69848", GitTreeState:"clean", BuildDate:"2020-09-16T13:47:43Z", GoVersion:"go1.13.15", Compiler:"gc", Platform:"linux/amd64"}
  • OS (e.g. from /etc/os-release):
NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"

CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"
  • Kernel (e.g. uname -a): Linux cce-7tpawg0f-h8v6ksrs 3.10.0-957.12.1.el7.x86_64 #1 SMP Mon Apr 29 14:59:59 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

hansongChina avatar May 18 '22 09:05 hansongChina

/assign @qiankunli

Thor-wl avatar May 19 '22 01:05 Thor-wl

for inqueue := minReq.Add(attr.allocated).Add(attr.inqueue).Sub(attr.elastic).LessEqual(attr.realCapability, api.Infinity), if we use (1,1,1) to represent (cpu=1,mem=1,gpu=1),

(1,1,0) + (6,6,6) + (0,0,0) - (0,0,0) = (7,7,6) < (8,16,2) ==> false, In fact, the remaining resources of the queue ((8,16,2) - (7,7,6) = (1,9,-4)) can run task with resources = (1,1,0). I thought about this at the time, but for simplicity of calculation, I didn't change it.

maybe we can

availableAndElasticResources := attr.realCapability.Sub(attr.allocated).Sub(attr.inqueue).Add(attr.elastic)
inqueue := availableAndElasticResources.canRun(minReq)

qiankunli avatar May 20 '22 09:05 qiankunli

availableAndElasticResources := attr.realCapability.Sub(attr.allocated).Sub(attr.inqueue).Add(attr.elastic)
inqueue := availableAndElasticResources.canRun(minReq)

Panic may occur when using Sub

hansongChina avatar Jun 07 '22 07:06 hansongChina

Hello 👋 Looks like there was no activity on this issue for last 90 days. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity for 60 days, this issue will be closed (we can always reopen an issue if we need!).

stale[bot] avatar Sep 08 '22 22:09 stale[bot]

Closing for now as there was no activity for last 60 days after marked as stale, let us know if you need this to be reopened! 🤗

stale[bot] avatar Nov 12 '22 09:11 stale[bot]