volcano icon indicating copy to clipboard operation
volcano copied to clipboard

When a hierarchical queue is enabled, scheduling of tasks on regular queues fails

Open D-xiaobin opened this issue 8 months ago • 16 comments

Description

After starting the hierarchical queue, the task scheduling failure under the normal queue has been penging, but when I start the task under the leaf queue of the hierarchical queue, my normal queue task can be scheduled successfully

Steps to reproduce the issue

1.job yaml under my normal queue apiVersion: batch.volcano.sh/v1alpha1 kind: Job metadata: name: high-job spec: queue: queue schedulerName: volcano minAvailable: 1 tasks: - replicas: 1 name: high-task template: spec: containers: - name: high-container image: 10.231.2.14/library/gpu_burn:latest command: ["/bin/bash", "-c", "nvidia-smi && sleep 600"] resources: limits: nvidia.com/gpu: 2 restartPolicy: Never My queue Image

2.Tasks in my hierarchical queue

apiVersion: batch.volcano.sh/v1alpha1 kind: Job metadata: name: job-a spec: queue: subchild-queue-a1 schedulerName: volcano minAvailable: 1 tasks: - replicas: 1 name: test template: spec: containers: - image: 10.231.2.14/library/gpu_burn:latest name: burn command: ["/bin/bash", "-c", "nvidia-smi && sleep 600"] imagePullPolicy: IfNotPresent resources: requests: cpu: "2" memory: 4Gi nvidia.com/gpu: 2 limits: cpu: "2" memory: "4Gi" nvidia.com/gpu: 2 3.My child-queue #child-queue-a的父队列为root队列 apiVersion: scheduling.volcano.sh/v1beta1 kind: Queue metadata: name: child-queue-a spec: reclaimable: true parent: root deserved: cpu: 64 memory: 128Gi nvidia.com/gpu: "8"

#child-queue-b的父队列为root队列 apiVersion: scheduling.volcano.sh/v1beta1 kind: Queue metadata: name: child-queue-b spec: reclaimable: true parent: root deserved: cpu: 64 memory: 128Gi nvidia.com/gpu: "16"

4、my subchild-queue #subchild-queue-a1的父队列为child-queue-a队列 apiVersion: scheduling.volcano.sh/v1beta1 kind: Queue metadata: name: subchild-queue-a1 spec: reclaimable: true parent: child-queue-a #可根据需要设置deserved,队列已分配资源若已超过deserved值,则队列中任务可被抢占 deserved: cpu: 32 memory: 64Gi nvidia.com/gpu: "3"

#subchild-queue-a2的父队列为child-queue-a队列 apiVersion: scheduling.volcano.sh/v1beta1 kind: Queue metadata: name: subchild-queue-a2 spec: reclaimable: true parent: child-queue-a #可根据需要设置deserved,队列已分配资源若已超过deserved值,则队列中任务可被抢占 deserved: cpu: 32 memory: 64Gi nvidia.com/gpu: "4"

5、At this point when I'm in the high-job state alone pending

Image

When I start vcjob both jobs are running

Image

Describe the results you received and expected

I wonder why,How to solve it

What version of Volcano are you using?

v1.11.1

Any other relevant information

No response

D-xiaobin avatar Apr 24 '25 09:04 D-xiaobin

/cc

hwdef avatar Apr 28 '25 09:04 hwdef

/cc

who

D-xiaobin avatar Apr 28 '25 09:04 D-xiaobin

This means I want to pay attention to the subsequent progress

hwdef avatar Apr 28 '25 13:04 hwdef

Let me confirm, you mean there is a normal queue, its name is "queue", and then you submit a high-job to this queue, right? But in fact, you have enabled the hierarchical queue functionality, and this "queue" is also a subqueue of the root queue. The queue plugin you use is capacity, right? We can set the scheduler log level to 5 and see why the high-job pg is Inqueue instead of Running.

JesseStutler avatar May 06 '25 03:05 JesseStutler

Image

This is the log of vc-scheduler, he reported Failed to check queue's hierarchical structure, error: deserved resources of queue are less than the sum of its child queues' deserved resources, Then I doubled the child-queue-a queue but it still didn't work

Image

D-xiaobin avatar May 06 '25 05:05 D-xiaobin

When I remove the subchild-queue, it works again

D-xiaobin avatar May 06 '25 06:05 D-xiaobin

@JesseStutler

D-xiaobin avatar May 06 '25 06:05 D-xiaobin

After you double the deserved value of child-queue-a, do the error logs for the hierarchical queue check still appear? @D-xiaobin

JesseStutler avatar May 06 '25 06:05 JesseStutler

Yes or Capacity plugin failed to check queue's hierarchical structure! Now the job creation under my hierarchy queue is pending, I don't know what happened. @JesseStutler

D-xiaobin avatar May 06 '25 06:05 D-xiaobin

The log is still deserved resources of queue <%s> are less than the sum of its child queues' deserved resources"?

JesseStutler avatar May 07 '25 02:05 JesseStutler

No, now report Capacity plugin failed to check queue's hierarchical structure! Now I see that the queue configuration works just fine if you comment out the gpu resources, but the queue configuration I subsequently changed does not exceed the total cluster gpu resources

This is the gpu used by my queue

Image The parent queue of subchild-queue-a1 and subchild-queue-a2 is child-queue-a, the parent queue of subchild-queue-b1 is child-queue-b, queue is the default queue belongs to the root queue

D-xiaobin avatar May 07 '25 07:05 D-xiaobin

Please take a look. Thanks, bro @JesseStutler

D-xiaobin avatar May 07 '25 07:05 D-xiaobin

/assign

JesseStutler avatar May 08 '25 11:05 JesseStutler

#3946 can fix this bug, it's because attr.request's ScalerResource is an empty map, and MinDimension's defaultValue incorrectly set to api.Zero, so that attr.Deserved will also be set to 0: https://github.com/volcano-sh/volcano/blob/9da8c27fa355ab88284696dd1dfcaee7d232e34b/pkg/scheduler/plugins/capacity/capacity.go#L682

We don't need this setting in capacity plugin, so directly delete this line is fine

JesseStutler avatar May 08 '25 12:05 JesseStutler

thank you,I'll keep an eye on it.

D-xiaobin avatar May 09 '25 01:05 D-xiaobin

#3946 has merged, you can try again after new version released, thanks @D-xiaobin

JesseStutler avatar May 09 '25 04:05 JesseStutler

why it was removed by later version in v1.12.2 I met the same problem;when used capacity plugin, and all task can not scheduled

rockburning avatar Aug 16 '25 17:08 rockburning

why it was removed by later version in v1.12.2 I met the same problem;when used capacity plugin, and all task can not scheduled

What phenomenon? Did the capacity hierarchy check also fail?

JesseStutler avatar Aug 17 '25 14:08 JesseStutler

why it was removed by later version in v1.12.2 I met the same problem;when used capacity plugin, and all task can not scheduled

What phenomenon? Did the capacity hierarchy check also fail?

yes, i will check fail again , now i use the master branch,seems it works. https://github.com/volcano-sh/volcano/commits/master/pkg/scheduler/plugins/capacity/capacity.go

Image the master branch merge this mr ,and it works ok for me

rockburning avatar Aug 18 '25 15:08 rockburning