When a hierarchical queue is enabled, scheduling of tasks on regular queues fails
Description
After starting the hierarchical queue, the task scheduling failure under the normal queue has been penging, but when I start the task under the leaf queue of the hierarchical queue, my normal queue task can be scheduled successfully
Steps to reproduce the issue
1.job yaml under my normal queue
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
name: high-job
spec:
queue: queue
schedulerName: volcano
minAvailable: 1
tasks:
- replicas: 1
name: high-task
template:
spec:
containers:
- name: high-container
image: 10.231.2.14/library/gpu_burn:latest
command: ["/bin/bash", "-c", "nvidia-smi && sleep 600"]
resources:
limits:
nvidia.com/gpu: 2
restartPolicy: Never
My queue
2.Tasks in my hierarchical queue
apiVersion: batch.volcano.sh/v1alpha1 kind: Job metadata: name: job-a spec: queue: subchild-queue-a1 schedulerName: volcano minAvailable: 1 tasks: - replicas: 1 name: test template: spec: containers: - image: 10.231.2.14/library/gpu_burn:latest name: burn command: ["/bin/bash", "-c", "nvidia-smi && sleep 600"] imagePullPolicy: IfNotPresent resources: requests: cpu: "2" memory: 4Gi nvidia.com/gpu: 2 limits: cpu: "2" memory: "4Gi" nvidia.com/gpu: 2 3.My child-queue #child-queue-a的父队列为root队列 apiVersion: scheduling.volcano.sh/v1beta1 kind: Queue metadata: name: child-queue-a spec: reclaimable: true parent: root deserved: cpu: 64 memory: 128Gi nvidia.com/gpu: "8"
#child-queue-b的父队列为root队列 apiVersion: scheduling.volcano.sh/v1beta1 kind: Queue metadata: name: child-queue-b spec: reclaimable: true parent: root deserved: cpu: 64 memory: 128Gi nvidia.com/gpu: "16"
4、my subchild-queue #subchild-queue-a1的父队列为child-queue-a队列 apiVersion: scheduling.volcano.sh/v1beta1 kind: Queue metadata: name: subchild-queue-a1 spec: reclaimable: true parent: child-queue-a #可根据需要设置deserved,队列已分配资源若已超过deserved值,则队列中任务可被抢占 deserved: cpu: 32 memory: 64Gi nvidia.com/gpu: "3"
#subchild-queue-a2的父队列为child-queue-a队列 apiVersion: scheduling.volcano.sh/v1beta1 kind: Queue metadata: name: subchild-queue-a2 spec: reclaimable: true parent: child-queue-a #可根据需要设置deserved,队列已分配资源若已超过deserved值,则队列中任务可被抢占 deserved: cpu: 32 memory: 64Gi nvidia.com/gpu: "4"
5、At this point when I'm in the high-job state alone pending
When I start vcjob both jobs are running
Describe the results you received and expected
I wonder why,How to solve it
What version of Volcano are you using?
v1.11.1
Any other relevant information
No response
/cc
/cc
who
This means I want to pay attention to the subsequent progress
Let me confirm, you mean there is a normal queue, its name is "queue", and then you submit a high-job to this queue, right? But in fact, you have enabled the hierarchical queue functionality, and this "queue" is also a subqueue of the root queue. The queue plugin you use is capacity, right? We can set the scheduler log level to 5 and see why the high-job pg is Inqueue instead of Running.
This is the log of vc-scheduler, he reported Failed to check queue's hierarchical structure, error: deserved resources of queue
When I remove the subchild-queue, it works again
@JesseStutler
After you double the deserved value of child-queue-a, do the error logs for the hierarchical queue check still appear? @D-xiaobin
Yes or Capacity plugin failed to check queue's hierarchical structure! Now the job creation under my hierarchy queue is pending, I don't know what happened. @JesseStutler
The log is still deserved resources of queue <%s> are less than the sum of its child queues' deserved resources"?
No, now report Capacity plugin failed to check queue's hierarchical structure! Now I see that the queue configuration works just fine if you comment out the gpu resources, but the queue configuration I subsequently changed does not exceed the total cluster gpu resources
This is the gpu used by my queue
The parent queue of subchild-queue-a1 and subchild-queue-a2 is child-queue-a, the parent queue of subchild-queue-b1 is child-queue-b, queue is the default queue belongs to the root queue
Please take a look. Thanks, bro @JesseStutler
/assign
#3946 can fix this bug, it's because attr.request's ScalerResource is an empty map, and MinDimension's defaultValue incorrectly set to api.Zero, so that attr.Deserved will also be set to 0: https://github.com/volcano-sh/volcano/blob/9da8c27fa355ab88284696dd1dfcaee7d232e34b/pkg/scheduler/plugins/capacity/capacity.go#L682
We don't need this setting in capacity plugin, so directly delete this line is fine
thank you,I'll keep an eye on it.
#3946 has merged, you can try again after new version released, thanks @D-xiaobin
why it was removed by later version in v1.12.2 I met the same problem;when used capacity plugin, and all task can not scheduled
why it was removed by later version in v1.12.2 I met the same problem;when used capacity plugin, and all task can not scheduled
What phenomenon? Did the capacity hierarchy check also fail?
why it was removed by later version in v1.12.2 I met the same problem;when used capacity plugin, and all task can not scheduled
What phenomenon? Did the capacity hierarchy check also fail?
yes, i will check fail again , now i use the master branch,seems it works. https://github.com/volcano-sh/volcano/commits/master/pkg/scheduler/plugins/capacity/capacity.go