volcano
volcano copied to clipboard
Fair sharing not working
What happened: My cluster has total 11 CPU. I'm trying to create 2 queue(excluding default queue) with weight 5 for each queue. Queue manifest,
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
name: test
spec:
weight: 5
---
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
name: test1
spec:
weight: 5
Queue List,
Name Weight State Inqueue Pending Running Unknown
default 1 Open 0 0 0 0
test 5 Open 0 0 0 0
test1 5 Open 0 0 0 0
Created 3 Jobs for test queue with CPU resource as follow, job1 -> CPU 5 job2 -> CPU 5 job3 -> CPU 1
Now all 3 jobs are running and utilizing full cluster.
Now i'm creating new Job in test1 queue with CPU 2. I'm expecting 1 Job will be evicted from test queue and Job in test1 queue will be running. But Job in test1 queue is in Inqueue state.
Name Weight State Inqueue Pending Running Unknown
default 1 Open 0 0 0 0
test 5 Open 0 0 3 0
test1 5 Open 1 0 0 0
Configuration,
actions: "enqueue, allocate, backfill"
tiers:
- plugins:
- name: priority
- name: gang
- name: conformance
- plugins:
- name: drf
- name: predicates
- name: proportion
- name: nodeorder
- name: binpack
What you expected to happen: I'm expecting 1 Job will be evicted from test queue and Job in test1 queue will be running. But Job in test1 queue is in Inqueue state. How to reproduce it (as minimally and precisely as possible):
Anything else we need to know?:
Environment:
- Volcano Version: v1.3.0
- Kubernetes version (use
kubectl version
): - Cloud provider or hardware configuration:
- OS (e.g. from /etc/os-release):
- Kernel (e.g.
uname -a
): - Install tools:
- Others:
Can you help fot that? @renhuanyu
@Sharathmk99 please config reclaim
action and try again.
actions: "enqueue, allocate, backfill, reclaim"
tiers:
- plugins:
- name: priority
- name: gang
- name: conformance
- plugins:
- name: drf
- name: predicates
- name: proportion
- name: nodeorder
- name: binpack
actions should include reclaim if you want to use Fair sharing feature
@wpeng102 @lowang-bh Included reclaim
in action. Still same issue,
Name Weight State Inqueue Pending Running Unknown
default 1 Open 0 0 0 0
test 5 Open 0 0 3 0
test1 5 Open 1 0 0 0
Config,
actions: "enqueue, allocate, backfill, reclaim"
tiers:
- plugins:
- name: priority
- name: gang
- name: conformance
- plugins:
- name: drf
- name: predicates
- name: proportion
- name: nodeorder
- name: binpack
Restarted all 3 pods,
kubectl rollout restart deployment -n volcano-system
Do I need to share any other details?
@lowang-bh @wpeng102 Is my Job manifest correct?
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
name: test-job
namespace: default
spec:
schedulerName: volcano
policies:
- event: TaskCompleted
action: CompleteJob
queue: test1
tasks:
- replicas: 1
name: "bash"
policies:
- event: TaskCompleted
action: CompleteJob
template:
metadata:
labels:
app: bash
spec:
containers:
- name: bash
image: bash
command: ["bash", "-c", "echo 'sleep...'; sleep 300"]
resources:
requests:
cpu: "2"
restartPolicy: Never
@wpeng102 @lowang-bh
Description of PodGroup,
kubectl describe podgroup -n default test-job
Name: test-job
Namespace: default
Labels: <none>
Annotations: <none>
API Version: scheduling.volcano.sh/v1beta1
Kind: PodGroup
Metadata:
Creation Timestamp: 2021-07-15T08:00:24Z
Generation: 5
Managed Fields:
API Version: scheduling.volcano.sh/v1beta1
Fields Type: FieldsV1
fieldsV1:
f:status:
f:conditions:
f:phase:
Manager: vc-scheduler
Operation: Update
Time: 2021-07-15T08:00:25Z
API Version: scheduling.volcano.sh/v1beta1
Fields Type: FieldsV1
fieldsV1:
f:metadata:
f:annotations:
.:
f:kubectl.kubernetes.io/last-applied-configuration:
f:ownerReferences:
.:
k:{"uid":"dc958eed-ee1d-4d7c-a668-cc44df37bafb"}:
.:
f:apiVersion:
f:blockOwnerDeletion:
f:controller:
f:kind:
f:name:
f:uid:
f:spec:
.:
f:minMember:
f:minResources:
.:
f:cpu:
f:minTaskMember:
.:
f:bash:
f:queue:
f:status:
.:
f:conditions:
f:phase:
Manager: vc-controller-manager
Operation: Update
Time: 2021-07-15T08:00:26Z
Owner References:
API Version: batch.volcano.sh/v1alpha1
Block Owner Deletion: true
Controller: true
Kind: Job
Name: test-job
UID: dc958eed-ee1d-4d7c-a668-cc44df37bafb
Resource Version: 184342
UID: 805f905b-02d3-498e-9846-770caeeed38f
Spec:
Min Member: 1
Min Resources:
Cpu: 2
Min Task Member:
Bash: 1
Queue: test1
Status:
Conditions:
Last Transition Time: 2021-07-15T08:00:26Z
Message: 1/0 tasks in gang unschedulable: pod group is not ready, 1 minAvailable.
Reason: NotEnoughResources
Status: True
Transition ID: 444ca19c-bbed-4855-b75a-38d81995bd52
Type: Unschedulable
Phase: Inqueue
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning Unschedulable 15s (x2 over 16s) volcano 0/0 tasks in gang unschedulable: pod group is not ready, 1 minAvailable.
Warning Unschedulable 1s (x14 over 14s) volcano 1/1 tasks in gang unschedulable: pod group is not ready, 1 Pending, 1 minAvailable.
Pod describe events,
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 3m19s volcano all nodes are unavailable: 1 node(s) resource fit failed.
Logs of Scheduler,
kubectl logs -n volcano-system volcano-scheduler-595c747db-bx2g9
I0715 07:53:38.463612 1 session.go:151] Open Session f887351a-bada-4666-97a7-43e1e22ad2a7 with <4> Job and <3> Queues
I0715 07:53:38.463907 1 enqueue.go:44] Enter Enqueue ...
I0715 07:53:38.463925 1 enqueue.go:62] Added Queue <test> for Job <test-ns/test-job>
I0715 07:53:38.463930 1 enqueue.go:62] Added Queue <test1> for Job <default/test-job>
I0715 07:53:38.463935 1 enqueue.go:78] Try to enqueue PodGroup to 0 Queues
I0715 07:53:38.463938 1 enqueue.go:102] Leaving Enqueue ...
I0715 07:53:38.463943 1 allocate.go:43] Enter Allocate ...
I0715 07:53:38.463947 1 job_info.go:555] job test-job/default actual: map[bash:1], ji.TaskMinAvailable: map[]
I0715 07:53:38.463955 1 job_info.go:555] job test-job2/test-ns actual: map[bash:1], ji.TaskMinAvailable: map[]
I0715 07:53:38.463962 1 job_info.go:555] job test-job/test-ns actual: map[bash:1], ji.TaskMinAvailable: map[]
I0715 07:53:38.463967 1 job_info.go:555] job test-job1/test-ns actual: map[bash:1], ji.TaskMinAvailable: map[]
I0715 07:53:38.463973 1 allocate.go:94] Try to allocate resource to 2 Namespaces
I0715 07:53:38.463980 1 allocate.go:158] Try to allocate resource to Jobs in Namespace <default> Queue <test1>
I0715 07:53:38.463986 1 allocate.go:192] Try to allocate resource to 1 tasks of Job <default/test-job>
I0715 07:53:38.463990 1 allocate.go:200] There are <1> nodes for Job <default/test-job>
I0715 07:53:38.464027 1 scheduler_helper.go:103] Predicates failed for task <default/test-job-bash-0> on node <docker-desktop>: task default/test-job-bash-0 on node docker-desktop fit failed: node(s) resource fit failed
I0715 07:53:38.464057 1 statement.go:353] Discarding operations ...
I0715 07:53:38.464079 1 allocate.go:158] Try to allocate resource to Jobs in Namespace <default> Queue <test1>
I0715 07:53:38.464090 1 allocate.go:154] Namespace <default> have no queue, skip it
I0715 07:53:38.464093 1 proportion.go:247] Queue <test>: deserved <cpu 10000.00, memory 0.00, hugepages-2Mi 0.00>, allocated <cpu 11000.00, memory 0.00>, share <1.1>
I0715 07:53:38.464104 1 allocate.go:143] Namespace <test-ns> Queue <test> is overused, ignore it.
I0715 07:53:38.464111 1 allocate.go:154] Namespace <test-ns> have no queue, skip it
I0715 07:53:38.464120 1 allocate.go:271] Leaving Allocate ...
I0715 07:53:38.464124 1 backfill.go:41] Enter Backfill ...
I0715 07:53:38.464128 1 job_info.go:555] job test-job/default actual: map[bash:1], ji.TaskMinAvailable: map[]
I0715 07:53:38.464133 1 job_info.go:555] job test-job2/test-ns actual: map[bash:1], ji.TaskMinAvailable: map[]
I0715 07:53:38.464138 1 job_info.go:555] job test-job/test-ns actual: map[bash:1], ji.TaskMinAvailable: map[]
I0715 07:53:38.464141 1 job_info.go:555] job test-job1/test-ns actual: map[bash:1], ji.TaskMinAvailable: map[]
I0715 07:53:38.464146 1 backfill.go:91] Leaving Backfill ...
I0715 07:53:38.464150 1 reclaim.go:41] Enter Reclaim ...
I0715 07:53:38.464154 1 reclaim.go:50] There are <4> Jobs and <3> Queues in total for scheduling.
I0715 07:53:38.464157 1 job_info.go:555] job test-job2/test-ns actual: map[bash:1], ji.TaskMinAvailable: map[]
I0715 07:53:38.464162 1 job_info.go:555] job test-job/test-ns actual: map[bash:1], ji.TaskMinAvailable: map[]
I0715 07:53:38.464166 1 job_info.go:555] job test-job1/test-ns actual: map[bash:1], ji.TaskMinAvailable: map[]
I0715 07:53:38.464171 1 job_info.go:555] job test-job/default actual: map[bash:1], ji.TaskMinAvailable: map[]
I0715 07:53:38.464184 1 reclaim.go:124] Considering Task <default/test-job-bash-0> on Node <docker-desktop>.
I0715 07:53:38.464203 1 reclaim.go:148] No validated victims on Node <docker-desktop>: no victims
I0715 07:53:38.464207 1 proportion.go:247] Queue <test>: deserved <cpu 10000.00, memory 0.00, hugepages-2Mi 0.00>, allocated <cpu 11000.00, memory 0.00>, share <1.1>
I0715 07:53:38.464212 1 reclaim.go:95] Queue <test> is overused, ignore it.
I0715 07:53:38.464217 1 reclaim.go:189] Leaving Reclaim ...
I0715 07:53:38.464336 1 session.go:170] Close Session f887351a-bada-4666-97a7-43e1e22ad2a7
@wpeng102 @Thor-wl @renhuanyu could you please help me to figure out the issue?
reclaim works when multiple conditions met the requirement:
you can check it from AddReclaimableFn
- gang plugin:
preemptable := job.MinAvailable == 0 || job.MinAvailable <= job.ReadyTaskNum()-1
- conformance-plugin: evictor can not reclaim pod in system namespace or with system-priority
- drf plugin:
- proportion : victimee's queue derserve large than its allocated(that means it is overused)
@Sharathmk99 could you change the scheduler log level to V=4 and append the scheduler log again?
you can edit the scheduler deployment and modify -v=4
in container args
https://github.com/volcano-sh/volcano/blob/44ec8eb28df26fbacd03ab15edabfc5916900c25/installer/volcano-development.yaml#L7592-L7595
@wpeng102 Please find below logs,
I0722 20:42:56.019764 1 session.go:151] Open Session 68f9ce6a-7517-4b0e-a8a5-ab3c1f4ff804 with <4> Job and <3> Queues
I0722 20:42:56.020020 1 proportion.go:75] The total resource is <cpu 12000.00, memory 26644484096.00, hugepages-2Mi 0.00>
I0722 20:42:56.020075 1 proportion.go:79] Considering Job <default/test-job1>.
I0722 20:42:56.020088 1 proportion.go:97] Added Queue <test> attributes.
I0722 20:42:56.020099 1 proportion.go:79] Considering Job <default/test-job>.
I0722 20:42:56.020106 1 proportion.go:79] Considering Job <default/test-job2>.
I0722 20:42:56.020113 1 proportion.go:79] Considering Job <default/test-job-new>.
I0722 20:42:56.020121 1 proportion.go:97] Added Queue <test1> attributes.
I0722 20:42:56.020172 1 proportion.go:155] Considering Queue <test>: weight <5>, total weight <10>.
I0722 20:42:56.020224 1 proportion.go:175] Format queue <test> deserved resource to <cpu 6000.00, memory 0.00, hugepages-2Mi 0.00>
I0722 20:42:56.020273 1 proportion.go:179] The attributes of queue <test> in proportion: deserved <cpu 6000.00, memory 0.00, hugepages-2Mi 0.00>, allocate <cpu 11000.00, memory 0.00>, request <cpu 11000.00, memory 0.00>, share <1.83>
I0722 20:42:56.020353 1 proportion.go:155] Considering Queue <test1>: weight <5>, total weight <10>.
I0722 20:42:56.020386 1 proportion.go:172] queue <test1> is meet
I0722 20:42:56.020426 1 proportion.go:179] The attributes of queue <test1> in proportion: deserved <cpu 500.00, memory 0.00>, allocate <cpu 0.00, memory 0.00>, request <cpu 500.00, memory 0.00>, share <0.00>
I0722 20:42:56.020452 1 proportion.go:191] Remaining resource is <cpu 5500.00, memory 26644484096.00, hugepages-2Mi 0.00>
I0722 20:42:56.020474 1 proportion.go:155] Considering Queue <test>: weight <5>, total weight <5>.
I0722 20:42:56.020488 1 proportion.go:172] queue <test> is meet
I0722 20:42:56.020515 1 proportion.go:179] The attributes of queue <test> in proportion: deserved <cpu 11000.00, memory 0.00>, allocate <cpu 11000.00, memory 0.00>, request <cpu 11000.00, memory 0.00>, share <1.00>
I0722 20:42:56.020532 1 proportion.go:155] Considering Queue <test1>: weight <5>, total weight <5>.
I0722 20:42:56.020577 1 proportion.go:191] Remaining resource is <cpu 500.00, memory 26644484096.00, hugepages-2Mi 0.00>
I0722 20:42:56.020598 1 proportion.go:144] Exiting when total weight is 0
I0722 20:42:56.020919 1 binpack.go:158] Enter binpack plugin ...
I0722 20:42:56.020957 1 binpack.go:177] resources [] record in weight but not found on any node
I0722 20:42:56.020969 1 binpack.go:161] Leaving binpack plugin. binpack.weight[1], binpack.cpu[1], binpack.memory[1], no extend resources. ...
I0722 20:42:56.020984 1 drf.go:207] Total Allocatable cpu 12000.00, memory 26644484096.00, hugepages-2Mi 0.00
I0722 20:42:56.021018 1 enqueue.go:44] Enter Enqueue ...
I0722 20:42:56.021055 1 enqueue.go:62] Added Queue <test> for Job <default/test-job>
I0722 20:42:56.021067 1 enqueue.go:62] Added Queue <test1> for Job <default/test-job-new>
I0722 20:42:56.021124 1 enqueue.go:78] Try to enqueue PodGroup to 0 Queues
I0722 20:42:56.021155 1 enqueue.go:102] Leaving Enqueue ...
I0722 20:42:56.021167 1 allocate.go:43] Enter Allocate ...
I0722 20:42:56.021176 1 job_info.go:555] job test-job1/default actual: map[bash:1], ji.TaskMinAvailable: map[]
I0722 20:42:56.021192 1 allocate.go:90] Added Job <default/test-job1> into Queue <test>
I0722 20:42:56.021201 1 job_info.go:555] job test-job/default actual: map[bash:1], ji.TaskMinAvailable: map[]
I0722 20:42:56.021210 1 allocate.go:90] Added Job <default/test-job> into Queue <test>
I0722 20:42:56.021217 1 priority.go:69] Priority JobOrderFn: <default/test-job> priority: 0, <default/test-job1> priority: 0
I0722 20:42:56.021224 1 gang.go:116] Gang JobOrderFn: <default/test-job> is ready: true, <default/test-job1> is ready: true
I0722 20:42:56.021230 1 drf.go:414] DRF JobOrderFn: <default/test-job> share state: 0.4166666666666667, <default/test-job1> share state: 0.4166666666666667
I0722 20:42:56.021242 1 job_info.go:555] job test-job2/default actual: map[bash:1], ji.TaskMinAvailable: map[]
I0722 20:42:56.021278 1 allocate.go:90] Added Job <default/test-job2> into Queue <test>
I0722 20:42:56.021285 1 priority.go:69] Priority JobOrderFn: <default/test-job2> priority: 0, <default/test-job> priority: 0
I0722 20:42:56.021291 1 gang.go:116] Gang JobOrderFn: <default/test-job2> is ready: true, <default/test-job> is ready: true
I0722 20:42:56.021321 1 drf.go:414] DRF JobOrderFn: <default/test-job2> share state: 0.08333333333333333, <default/test-job> share state: 0.4166666666666667
I0722 20:42:56.021334 1 job_info.go:555] job test-job-new/default actual: map[bash-new:1], ji.TaskMinAvailable: map[]
I0722 20:42:56.021345 1 allocate.go:90] Added Job <default/test-job-new> into Queue <test1>
I0722 20:42:56.021354 1 allocate.go:94] Try to allocate resource to 1 Namespaces
I0722 20:42:56.021363 1 allocate.go:109] unlockedNode ID: 62db948c-9907-4163-b4cc-a03f9741ea2d, Name: docker-desktop
I0722 20:42:56.021380 1 allocate.go:158] Try to allocate resource to Jobs in Namespace <default> Queue <test1>
I0722 20:42:56.021394 1 allocate.go:192] Try to allocate resource to 1 tasks of Job <default/test-job-new>
I0722 20:42:56.021430 1 allocate.go:200] There are <1> nodes for Job <default/test-job-new>
I0722 20:42:56.021525 1 scheduler_helper.go:98] Considering Task <default/test-job-new-bash-new-0> on node <docker-desktop>: <cpu 500.00, memory 0.00> vs. <cpu 150.00, memory 26392825856.00, hugepages-2Mi 0.00>
I0722 20:42:56.021588 1 scheduler_helper.go:103] Predicates failed for task <default/test-job-new-bash-new-0> on node <docker-desktop>: task default/test-job-new-bash-new-0 on node docker-desktop fit failed: node(s) resource fit failed
I0722 20:42:56.021635 1 statement.go:353] Discarding operations ...
I0722 20:42:56.021648 1 allocate.go:158] Try to allocate resource to Jobs in Namespace <default> Queue <test1>
I0722 20:42:56.021656 1 allocate.go:164] Can not find jobs for queue test1.
I0722 20:42:56.021665 1 allocate.go:158] Try to allocate resource to Jobs in Namespace <default> Queue <test>
I0722 20:42:56.021670 1 priority.go:69] Priority JobOrderFn: <default/test-job1> priority: 0, <default/test-job> priority: 0
I0722 20:42:56.021677 1 gang.go:116] Gang JobOrderFn: <default/test-job1> is ready: true, <default/test-job> is ready: true
I0722 20:42:56.021682 1 drf.go:414] DRF JobOrderFn: <default/test-job1> share state: 0.4166666666666667, <default/test-job> share state: 0.4166666666666667
I0722 20:42:56.021699 1 allocate.go:192] Try to allocate resource to 0 tasks of Job <default/test-job2>
I0722 20:42:56.021731 1 statement.go:378] Committing operations ...
I0722 20:42:56.021745 1 allocate.go:158] Try to allocate resource to Jobs in Namespace <default> Queue <test>
I0722 20:42:56.021779 1 allocate.go:192] Try to allocate resource to 0 tasks of Job <default/test-job>
I0722 20:42:56.021787 1 statement.go:378] Committing operations ...
I0722 20:42:56.021794 1 allocate.go:158] Try to allocate resource to Jobs in Namespace <default> Queue <test>
I0722 20:42:56.021804 1 allocate.go:192] Try to allocate resource to 0 tasks of Job <default/test-job1>
I0722 20:42:56.021807 1 statement.go:378] Committing operations ...
I0722 20:42:56.021814 1 allocate.go:158] Try to allocate resource to Jobs in Namespace <default> Queue <test>
I0722 20:42:56.021823 1 allocate.go:164] Can not find jobs for queue test.
I0722 20:42:56.021839 1 allocate.go:154] Namespace <default> have no queue, skip it
I0722 20:42:56.021854 1 allocate.go:271] Leaving Allocate ...
I0722 20:42:56.021862 1 backfill.go:41] Enter Backfill ...
I0722 20:42:56.021869 1 job_info.go:555] job test-job/default actual: map[bash:1], ji.TaskMinAvailable: map[]
I0722 20:42:56.021882 1 job_info.go:555] job test-job2/default actual: map[bash:1], ji.TaskMinAvailable: map[]
I0722 20:42:56.021921 1 job_info.go:555] job test-job-new/default actual: map[bash-new:1], ji.TaskMinAvailable: map[]
I0722 20:42:56.021936 1 job_info.go:555] job test-job1/default actual: map[bash:1], ji.TaskMinAvailable: map[]
I0722 20:42:56.021976 1 backfill.go:91] Leaving Backfill ...
I0722 20:42:56.021988 1 reclaim.go:41] Enter Reclaim ...
I0722 20:42:56.021995 1 reclaim.go:50] There are <4> Jobs and <3> Queues in total for scheduling.
I0722 20:42:56.022003 1 job_info.go:555] job test-job/default actual: map[bash:1], ji.TaskMinAvailable: map[]
I0722 20:42:56.022029 1 reclaim.go:67] Added Queue <test> for Job <default/test-job>
I0722 20:42:56.022055 1 job_info.go:555] job test-job2/default actual: map[bash:1], ji.TaskMinAvailable: map[]
I0722 20:42:56.022095 1 job_info.go:555] job test-job-new/default actual: map[bash-new:1], ji.TaskMinAvailable: map[]
I0722 20:42:56.022109 1 reclaim.go:67] Added Queue <test1> for Job <default/test-job-new>
I0722 20:42:56.022120 1 job_info.go:555] job test-job1/default actual: map[bash:1], ji.TaskMinAvailable: map[]
I0722 20:42:56.022150 1 reclaim.go:124] Considering Task <default/test-job-new-bash-new-0> on Node <docker-desktop>.
I0722 20:42:56.022187 1 gang.go:93] Can not preempt task <default/test-job2-bash-0> because of gang-scheduling
I0722 20:42:56.022193 1 gang.go:93] Can not preempt task <default/test-job1-bash-0> because of gang-scheduling
I0722 20:42:56.022197 1 gang.go:93] Can not preempt task <default/test-job-bash-0> because of gang-scheduling
I0722 20:42:56.022202 1 gang.go:100] Victims from Gang plugins are []
I0722 20:42:56.022211 1 proportion.go:236] Victims from proportion plugins are []
I0722 20:42:56.022225 1 reclaim.go:148] No validated victims on Node <docker-desktop>: no victims
I0722 20:42:56.022242 1 reclaim.go:189] Leaving Reclaim ...
I0722 20:42:56.022478 1 cache.go:645] task unscheduleable default/test-job-new-bash-new-0, message: all nodes are unavailable: 1 node(s) resource fit failed., skip by no condition update
I0722 20:42:56.022520 1 session.go:170] Close Session 68f9ce6a-7517-4b0e-a8a5-ab3c1f4ff804
I0722 20:42:56.022530 1 scheduler.go:110] End scheduling ...
Please note below logs from above,
I0722 20:42:56.022187 1 gang.go:93] Can not preempt task <default/test-job2-bash-0> because of gang-scheduling
I0722 20:42:56.022193 1 gang.go:93] Can not preempt task <default/test-job1-bash-0> because of gang-scheduling
I0722 20:42:56.022197 1 gang.go:93] Can not preempt task <default/test-job-bash-0> because of gang-scheduling
@Sharathmk99 Thank you for append the scheduler log, please confirm if set PriorityClass for the volcano job. From the code logic if the preemptor priority is higher than preemptees's, the gang plugin will reject for job preemption.
https://github.com/volcano-sh/volcano/blob/44ec8eb28df26fbacd03ab15edabfc5916900c25/pkg/scheduler/plugins/gang/gang.go#L90-L94
@wpeng102 priorityClass is same for all volcano jobs.
But test queue is using resources more than deserved, a job from test queue should get evicted right?
@Sharathmk99 Thanks for reporting this issue, https://github.com/volcano-sh/volcano/issues/1642 should share the same root cause with this one.
For the workaround, maybe you can try exchage gang
and proportion
postion
actions: "enqueue, allocate, backfill, reclaim"
tiers:
- plugins:
- name: priority
- name: proportion
- name: conformance
- plugins:
- name: gang
- name: drf
- name: predicates
- name: nodeorder
- name: binpack
@wpeng102 After exchanging gang
and proportion
position, still no luck,
Config,
actions: "enqueue, allocate, backfill, reclaim"
tiers:
- plugins:
- name: priority
- name: proportion
- name: conformance
- plugins:
- name: drf
- name: predicates
- name: gang
- name: nodeorder
- name: binpack
I0726 09:19:31.508938 1 session.go:151] Open Session d8e6b335-8c84-4d2a-928a-aeb658b8f925 with <4> Job and <3> Queues
I0726 09:19:31.509300 1 binpack.go:158] Enter binpack plugin ...
I0726 09:19:31.509319 1 binpack.go:177] resources [] record in weight but not found on any node
I0726 09:19:31.509325 1 binpack.go:161] Leaving binpack plugin. binpack.weight[1], binpack.cpu[1], binpack.memory[1], no extend resources. ...
I0726 09:19:31.509332 1 proportion.go:75] The total resource is <cpu 12000.00, memory 26644484096.00, hugepages-2Mi 0.00>
I0726 09:19:31.509341 1 proportion.go:79] Considering Job <default/test-job-new>.
I0726 09:19:31.509345 1 proportion.go:97] Added Queue <test1> attributes.
I0726 09:19:31.509349 1 proportion.go:79] Considering Job <default/test-job>.
I0726 09:19:31.509352 1 proportion.go:97] Added Queue <test> attributes.
I0726 09:19:31.509354 1 proportion.go:79] Considering Job <default/test-job1>.
I0726 09:19:31.509357 1 proportion.go:79] Considering Job <default/test-job2>.
I0726 09:19:31.509367 1 proportion.go:155] Considering Queue <test1>: weight <5>, total weight <10>.
I0726 09:19:31.509373 1 proportion.go:172] queue <test1> is meet
I0726 09:19:31.509377 1 proportion.go:179] The attributes of queue <test1> in proportion: deserved <cpu 500.00, memory 0.00>, allocate <cpu 0.00, memory 0.00>, request <cpu 500.00, memory 0.00>, share <0.00>
I0726 09:19:31.509384 1 proportion.go:155] Considering Queue <test>: weight <5>, total weight <10>.
I0726 09:19:31.509390 1 proportion.go:175] Format queue <test> deserved resource to <cpu 6000.00, memory 0.00, hugepages-2Mi 0.00>
I0726 09:19:31.509395 1 proportion.go:179] The attributes of queue <test> in proportion: deserved <cpu 6000.00, memory 0.00, hugepages-2Mi 0.00>, allocate <cpu 11000.00, memory 0.00>, request <cpu 11000.00, memory 0.00>, share <1.83>
I0726 09:19:31.509406 1 proportion.go:191] Remaining resource is <cpu 5500.00, memory 26644484096.00, hugepages-2Mi 0.00>
I0726 09:19:31.509453 1 proportion.go:155] Considering Queue <test1>: weight <5>, total weight <5>.
I0726 09:19:31.509458 1 proportion.go:155] Considering Queue <test>: weight <5>, total weight <5>.
I0726 09:19:31.509485 1 proportion.go:172] queue <test> is meet
I0726 09:19:31.509490 1 proportion.go:179] The attributes of queue <test> in proportion: deserved <cpu 11000.00, memory 0.00>, allocate <cpu 11000.00, memory 0.00>, request <cpu 11000.00, memory 0.00>, share <1.00>
I0726 09:19:31.509500 1 proportion.go:191] Remaining resource is <cpu 500.00, memory 26644484096.00, hugepages-2Mi 0.00>
I0726 09:19:31.509525 1 proportion.go:144] Exiting when total weight is 0
I0726 09:19:31.509533 1 drf.go:207] Total Allocatable cpu 12000.00, memory 26644484096.00, hugepages-2Mi 0.00
I0726 09:19:31.509549 1 enqueue.go:44] Enter Enqueue ...
I0726 09:19:31.509554 1 enqueue.go:62] Added Queue <test> for Job <default/test-job2>
I0726 09:19:31.509558 1 enqueue.go:62] Added Queue <test1> for Job <default/test-job-new>
I0726 09:19:31.509563 1 enqueue.go:78] Try to enqueue PodGroup to 0 Queues
I0726 09:19:31.509567 1 enqueue.go:102] Leaving Enqueue ...
I0726 09:19:31.509575 1 allocate.go:43] Enter Allocate ...
I0726 09:19:31.509580 1 job_info.go:555] job test-job2/default actual: map[bash:1], ji.TaskMinAvailable: map[]
I0726 09:19:31.509605 1 allocate.go:90] Added Job <default/test-job2> into Queue <test>
I0726 09:19:31.509611 1 job_info.go:555] job test-job-new/default actual: map[bash-new:1], ji.TaskMinAvailable: map[]
I0726 09:19:31.509616 1 allocate.go:90] Added Job <default/test-job-new> into Queue <test1>
I0726 09:19:31.509618 1 job_info.go:555] job test-job/default actual: map[bash:1], ji.TaskMinAvailable: map[]
I0726 09:19:31.509622 1 allocate.go:90] Added Job <default/test-job> into Queue <test>
I0726 09:19:31.509625 1 priority.go:69] Priority JobOrderFn: <default/test-job> priority: 0, <default/test-job2> priority: 0
I0726 09:19:31.509629 1 drf.go:414] DRF JobOrderFn: <default/test-job> share state: 0.4166666666666667, <default/test-job2> share state: 0.08333333333333333
I0726 09:19:31.509634 1 job_info.go:555] job test-job1/default actual: map[bash:1], ji.TaskMinAvailable: map[]
I0726 09:19:31.509641 1 allocate.go:90] Added Job <default/test-job1> into Queue <test>
I0726 09:19:31.509643 1 priority.go:69] Priority JobOrderFn: <default/test-job1> priority: 0, <default/test-job2> priority: 0
I0726 09:19:31.509647 1 drf.go:414] DRF JobOrderFn: <default/test-job1> share state: 0.4166666666666667, <default/test-job2> share state: 0.08333333333333333
I0726 09:19:31.509667 1 allocate.go:94] Try to allocate resource to 1 Namespaces
I0726 09:19:31.509672 1 allocate.go:109] unlockedNode ID: 62db948c-9907-4163-b4cc-a03f9741ea2d, Name: docker-desktop
I0726 09:19:31.509710 1 allocate.go:158] Try to allocate resource to Jobs in Namespace <default> Queue <test1>
I0726 09:19:31.509720 1 allocate.go:192] Try to allocate resource to 1 tasks of Job <default/test-job-new>
I0726 09:19:31.509764 1 allocate.go:200] There are <1> nodes for Job <default/test-job-new>
I0726 09:19:31.509797 1 scheduler_helper.go:98] Considering Task <default/test-job-new-bash-new-0> on node <docker-desktop>: <cpu 500.00, memory 0.00> vs. <cpu 150.00, memory 26392825856.00, hugepages-2Mi 0.00>
I0726 09:19:31.509832 1 scheduler_helper.go:103] Predicates failed for task <default/test-job-new-bash-new-0> on node <docker-desktop>: task default/test-job-new-bash-new-0 on node docker-desktop fit failed: node(s) resource fit failed
I0726 09:19:31.509877 1 statement.go:353] Discarding operations ...
I0726 09:19:31.509927 1 allocate.go:158] Try to allocate resource to Jobs in Namespace <default> Queue <test1>
I0726 09:19:31.509934 1 allocate.go:164] Can not find jobs for queue test1.
I0726 09:19:31.509939 1 allocate.go:158] Try to allocate resource to Jobs in Namespace <default> Queue <test>
I0726 09:19:31.509942 1 priority.go:69] Priority JobOrderFn: <default/test-job> priority: 0, <default/test-job1> priority: 0
I0726 09:19:31.509945 1 drf.go:414] DRF JobOrderFn: <default/test-job> share state: 0.4166666666666667, <default/test-job1> share state: 0.4166666666666667
I0726 09:19:31.509949 1 gang.go:116] Gang JobOrderFn: <default/test-job> is ready: true, <default/test-job1> is ready: true
I0726 09:19:31.509955 1 allocate.go:192] Try to allocate resource to 0 tasks of Job <default/test-job2>
I0726 09:19:31.509959 1 statement.go:378] Committing operations ...
I0726 09:19:31.509963 1 allocate.go:158] Try to allocate resource to Jobs in Namespace <default> Queue <test>
I0726 09:19:31.509968 1 allocate.go:192] Try to allocate resource to 0 tasks of Job <default/test-job>
I0726 09:19:31.509972 1 statement.go:378] Committing operations ...
I0726 09:19:31.509979 1 allocate.go:158] Try to allocate resource to Jobs in Namespace <default> Queue <test>
I0726 09:19:31.510001 1 allocate.go:192] Try to allocate resource to 0 tasks of Job <default/test-job1>
I0726 09:19:31.510005 1 statement.go:378] Committing operations ...
I0726 09:19:31.510010 1 allocate.go:158] Try to allocate resource to Jobs in Namespace <default> Queue <test>
I0726 09:19:31.510015 1 allocate.go:164] Can not find jobs for queue test.
I0726 09:19:31.510040 1 allocate.go:154] Namespace <default> have no queue, skip it
I0726 09:19:31.510047 1 allocate.go:271] Leaving Allocate ...
I0726 09:19:31.510072 1 backfill.go:41] Enter Backfill ...
I0726 09:19:31.510075 1 job_info.go:555] job test-job2/default actual: map[bash:1], ji.TaskMinAvailable: map[]
I0726 09:19:31.510108 1 job_info.go:555] job test-job-new/default actual: map[bash-new:1], ji.TaskMinAvailable: map[]
I0726 09:19:31.510112 1 job_info.go:555] job test-job/default actual: map[bash:1], ji.TaskMinAvailable: map[]
I0726 09:19:31.510116 1 job_info.go:555] job test-job1/default actual: map[bash:1], ji.TaskMinAvailable: map[]
I0726 09:19:31.510120 1 backfill.go:91] Leaving Backfill ...
I0726 09:19:31.510123 1 reclaim.go:41] Enter Reclaim ...
I0726 09:19:31.510126 1 reclaim.go:50] There are <4> Jobs and <3> Queues in total for scheduling.
I0726 09:19:31.510128 1 job_info.go:555] job test-job1/default actual: map[bash:1], ji.TaskMinAvailable: map[]
I0726 09:19:31.510132 1 reclaim.go:67] Added Queue <test> for Job <default/test-job1>
I0726 09:19:31.510138 1 job_info.go:555] job test-job2/default actual: map[bash:1], ji.TaskMinAvailable: map[]
I0726 09:19:31.510142 1 job_info.go:555] job test-job-new/default actual: map[bash-new:1], ji.TaskMinAvailable: map[]
I0726 09:19:31.510181 1 reclaim.go:67] Added Queue <test1> for Job <default/test-job-new>
I0726 09:19:31.510185 1 job_info.go:555] job test-job/default actual: map[bash:1], ji.TaskMinAvailable: map[]
I0726 09:19:31.510216 1 reclaim.go:124] Considering Task <default/test-job-new-bash-new-0> on Node <docker-desktop>.
I0726 09:19:31.510246 1 proportion.go:236] Victims from proportion plugins are []
I0726 09:19:31.510250 1 gang.go:93] Can not preempt task <default/test-job2-bash-0> because of gang-scheduling
I0726 09:19:31.510254 1 gang.go:93] Can not preempt task <default/test-job1-bash-0> because of gang-scheduling
I0726 09:19:31.510259 1 gang.go:93] Can not preempt task <default/test-job-bash-0> because of gang-scheduling
I0726 09:19:31.510263 1 gang.go:100] Victims from Gang plugins are []
I0726 09:19:31.510273 1 reclaim.go:148] No validated victims on Node <docker-desktop>: no victims
I0726 09:19:31.510348 1 reclaim.go:189] Leaving Reclaim ...
I0726 09:19:31.510507 1 cache.go:645] task unscheduleable default/test-job-new-bash-new-0, message: all nodes are unavailable: 1 node(s) resource fit failed., skip by no condition update
I0726 09:19:31.510641 1 session.go:170] Close Session d8e6b335-8c84-4d2a-928a-aeb658b8f925
I0726 09:19:31.510691 1 scheduler.go:110] End scheduling ...
@wpeng102 I did restart the deployment after configmap
changes
k rollout restart deployment -n volcano-system volcano-scheduler
@wpeng102 I tried to build docker image from master
branch. But still the issues is same. Not able to reclaim
guarantee resource for second queue. Is it possible to solve the above use case with Volcano?
Logs,
I0806 22:58:42.564073 1 scheduler.go:91] Start scheduling ...
I0806 22:58:42.564206 1 cache.go:840] The priority of job <default/test-job-new> is <high-pri/0>
I0806 22:58:42.564223 1 cache.go:840] The priority of job <default/test-job2> is </0>
I0806 22:58:42.564224 1 cache.go:840] The priority of job <default/test-job1> is </0>
I0806 22:58:42.564254 1 cache.go:840] The priority of job <default/test-job> is </0>
I0806 22:58:42.564305 1 cache.go:878] There are <4> Jobs, <3> Queues and <1> Nodes in total for scheduling.
I0806 22:58:42.564329 1 session.go:165] Open Session d1cb963c-8e81-4e8c-b5e5-e920535ee55f with <4> Job and <3> Queues
I0806 22:58:42.564345 1 proportion.go:73] The total resource is <cpu 12000.00, memory 26644484096.00, hugepages-2Mi 0.00>
I0806 22:58:42.564357 1 proportion.go:77] Considering Job <default/test-job2>.
I0806 22:58:42.564362 1 proportion.go:95] Added Queue <test> attributes.
I0806 22:58:42.564366 1 proportion.go:77] Considering Job <default/test-job1>.
I0806 22:58:42.564368 1 proportion.go:77] Considering Job <default/test-job>.
I0806 22:58:42.564371 1 proportion.go:77] Considering Job <default/test-job-new>.
I0806 22:58:42.564374 1 proportion.go:95] Added Queue <test1> attributes.
I0806 22:58:42.564385 1 proportion.go:153] Considering Queue <test>: weight <5>, total weight <10>.
I0806 22:58:42.564391 1 proportion.go:173] Format queue <test> deserved resource to <cpu 6000.00, memory 0.00, hugepages-2Mi 0.00>
I0806 22:58:42.564398 1 proportion.go:177] The attributes of queue <test> in proportion: deserved <cpu 6000.00, memory 0.00, hugepages-2Mi 0.00>, allocate <cpu 11000.00, memory 0.00>, request <cpu 11000.00, memory 0.00>, share <1.83>
I0806 22:58:42.564406 1 proportion.go:153] Considering Queue <test1>: weight <5>, total weight <10>.
I0806 22:58:42.564414 1 proportion.go:170] queue <test1> is meet
I0806 22:58:42.564418 1 proportion.go:177] The attributes of queue <test1> in proportion: deserved <cpu 500.00, memory 0.00>, allocate <cpu 0.00, memory 0.00>, request <cpu 500.00, memory 0.00>, share <0.00>
I0806 22:58:42.564442 1 proportion.go:189] Remaining resource is <cpu 5500.00, memory 26644484096.00, hugepages-2Mi 0.00>
I0806 22:58:42.564457 1 proportion.go:153] Considering Queue <test>: weight <5>, total weight <5>.
I0806 22:58:42.564467 1 proportion.go:170] queue <test> is meet
I0806 22:58:42.564476 1 proportion.go:177] The attributes of queue <test> in proportion: deserved <cpu 11000.00, memory 0.00>, allocate <cpu 11000.00, memory 0.00>, request <cpu 11000.00, memory 0.00>, share <1.00>
I0806 22:58:42.564484 1 proportion.go:153] Considering Queue <test1>: weight <5>, total weight <5>.
I0806 22:58:42.564490 1 proportion.go:189] Remaining resource is <cpu 500.00, memory 26644484096.00, hugepages-2Mi 0.00>
I0806 22:58:42.564514 1 proportion.go:142] Exiting when total weight is 0
I0806 22:58:42.564524 1 drf.go:206] Total Allocatable cpu 12000.00, memory 26644484096.00, hugepages-2Mi 0.00
I0806 22:58:42.564794 1 binpack.go:158] Enter binpack plugin ...
I0806 22:58:42.564814 1 binpack.go:177] resources [] record in weight but not found on any node
I0806 22:58:42.564820 1 binpack.go:161] Leaving binpack plugin. binpack.weight[1], binpack.cpu[1], binpack.memory[1], no extend resources. ...
I0806 22:58:42.564826 1 enqueue.go:44] Enter Enqueue ...
I0806 22:58:42.564830 1 enqueue.go:62] Added Queue <test1> for Job <default/test-job-new>
I0806 22:58:42.564834 1 enqueue.go:62] Added Queue <test> for Job <default/test-job2>
I0806 22:58:42.564838 1 enqueue.go:78] Try to enqueue PodGroup to 0 Queues
I0806 22:58:42.564842 1 enqueue.go:103] Leaving Enqueue ...
I0806 22:58:42.564846 1 allocate.go:43] Enter Allocate ...
I0806 22:58:42.564851 1 job_info.go:561] job test-job-new/default actual: map[bash-new:1], ji.TaskMinAvailable: map[bash-new:1]
I0806 22:58:42.564859 1 allocate.go:90] Added Job <default/test-job-new> into Queue <test1>
I0806 22:58:42.564862 1 job_info.go:561] job test-job2/default actual: map[bash:1], ji.TaskMinAvailable: map[bash:1]
I0806 22:58:42.564866 1 allocate.go:90] Added Job <default/test-job2> into Queue <test>
I0806 22:58:42.564869 1 job_info.go:561] job test-job1/default actual: map[bash:1], ji.TaskMinAvailable: map[bash:1]
I0806 22:58:42.564872 1 allocate.go:90] Added Job <default/test-job1> into Queue <test>
I0806 22:58:42.564875 1 priority.go:70] Priority JobOrderFn: <default/test-job1> priority: 0, <default/test-job2> priority: 0
I0806 22:58:42.564878 1 drf.go:413] DRF JobOrderFn: <default/test-job1> share state: 0.4166666666666667, <default/test-job2> share state: 0.08333333333333333
I0806 22:58:42.564882 1 job_info.go:561] job test-job/default actual: map[bash:1], ji.TaskMinAvailable: map[bash:1]
I0806 22:58:42.564906 1 allocate.go:90] Added Job <default/test-job> into Queue <test>
I0806 22:58:42.564916 1 priority.go:70] Priority JobOrderFn: <default/test-job> priority: 0, <default/test-job2> priority: 0
I0806 22:58:42.564934 1 drf.go:413] DRF JobOrderFn: <default/test-job> share state: 0.4166666666666667, <default/test-job2> share state: 0.08333333333333333
I0806 22:58:42.564955 1 allocate.go:94] Try to allocate resource to 1 Namespaces
I0806 22:58:42.564958 1 allocate.go:109] unlockedNode ID: 62db948c-9907-4163-b4cc-a03f9741ea2d, Name: docker-desktop
I0806 22:58:42.564987 1 allocate.go:162] Try to allocate resource to Jobs in Namespace <default> Queue <test1>
I0806 22:58:42.564994 1 allocate.go:196] Try to allocate resource to 1 tasks of Job <default/test-job-new>
I0806 22:58:42.564998 1 allocate.go:204] There are <1> nodes for Job <default/test-job-new>
I0806 22:58:42.565063 1 scheduler_helper.go:97] Considering Task <default/test-job-new-bash-new-0> on node <docker-desktop>: <cpu 500.00, memory 0.00> vs. <cpu 150.00, memory 26392825856.00, hugepages-2Mi 0.00>
I0806 22:58:42.565084 1 scheduler_helper.go:102] Predicates failed for task <default/test-job-new-bash-new-0> on node <docker-desktop>: task default/test-job-new-bash-new-0 on node docker-desktop fit failed: node(s) resource fit failed
I0806 22:58:42.565107 1 statement.go:351] Discarding operations ...
I0806 22:58:42.565115 1 allocate.go:162] Try to allocate resource to Jobs in Namespace <default> Queue <test>
I0806 22:58:42.565119 1 priority.go:70] Priority JobOrderFn: <default/test-job1> priority: 0, <default/test-job> priority: 0
I0806 22:58:42.565121 1 drf.go:413] DRF JobOrderFn: <default/test-job1> share state: 0.4166666666666667, <default/test-job> share state: 0.4166666666666667
I0806 22:58:42.565125 1 gang.go:118] Gang JobOrderFn: <default/test-job1> is ready: true, <default/test-job> is ready: true
I0806 22:58:42.565131 1 allocate.go:196] Try to allocate resource to 0 tasks of Job <default/test-job2>
I0806 22:58:42.565135 1 statement.go:376] Committing operations ...
I0806 22:58:42.565142 1 allocate.go:162] Try to allocate resource to Jobs in Namespace <default> Queue <test>
I0806 22:58:42.565148 1 allocate.go:196] Try to allocate resource to 0 tasks of Job <default/test-job>
I0806 22:58:42.565150 1 statement.go:376] Committing operations ...
I0806 22:58:42.565182 1 allocate.go:162] Try to allocate resource to Jobs in Namespace <default> Queue <test>
I0806 22:58:42.565226 1 allocate.go:196] Try to allocate resource to 0 tasks of Job <default/test-job1>
I0806 22:58:42.565232 1 statement.go:376] Committing operations ...
I0806 22:58:42.565248 1 allocate.go:158] Namespace <default> have no queue, skip it
I0806 22:58:42.565272 1 allocate.go:275] Leaving Allocate ...
I0806 22:58:42.565278 1 backfill.go:41] Enter Backfill ...
I0806 22:58:42.565281 1 job_info.go:561] job test-job/default actual: map[bash:1], ji.TaskMinAvailable: map[bash:1]
I0806 22:58:42.565287 1 job_info.go:561] job test-job-new/default actual: map[bash-new:1], ji.TaskMinAvailable: map[bash-new:1]
I0806 22:58:42.565294 1 job_info.go:561] job test-job2/default actual: map[bash:1], ji.TaskMinAvailable: map[bash:1]
I0806 22:58:42.565319 1 job_info.go:561] job test-job1/default actual: map[bash:1], ji.TaskMinAvailable: map[bash:1]
I0806 22:58:42.565348 1 backfill.go:90] Leaving Backfill ...
I0806 22:58:42.565352 1 reclaim.go:41] Enter Reclaim ...
I0806 22:58:42.565355 1 reclaim.go:50] There are <4> Jobs and <3> Queues in total for scheduling.
I0806 22:58:42.565359 1 job_info.go:561] job test-job-new/default actual: map[bash-new:1], ji.TaskMinAvailable: map[bash-new:1]
I0806 22:58:42.565364 1 reclaim.go:67] Added Queue <test1> for Job <default/test-job-new>
I0806 22:58:42.565367 1 job_info.go:561] job test-job2/default actual: map[bash:1], ji.TaskMinAvailable: map[bash:1]
I0806 22:58:42.565372 1 reclaim.go:67] Added Queue <test> for Job <default/test-job2>
I0806 22:58:42.565375 1 job_info.go:561] job test-job1/default actual: map[bash:1], ji.TaskMinAvailable: map[bash:1]
I0806 22:58:42.565379 1 job_info.go:561] job test-job/default actual: map[bash:1], ji.TaskMinAvailable: map[bash:1]
I0806 22:58:42.565393 1 reclaim.go:121] Considering Task <default/test-job-new-bash-new-0> on Node <docker-desktop>.
I0806 22:58:42.565429 1 proportion.go:234] Victims from proportion plugins are []
I0806 22:58:42.565434 1 gang.go:97] Can not preempt task <default/test-job1-bash-0> because job test-job1 ready num(1) <= MinAvailable(1) for gang-scheduling
I0806 22:58:42.565437 1 gang.go:97] Can not preempt task <default/test-job2-bash-0> because job test-job2 ready num(1) <= MinAvailable(1) for gang-scheduling
I0806 22:58:42.565460 1 gang.go:97] Can not preempt task <default/test-job-bash-0> because job test-job ready num(1) <= MinAvailable(1) for gang-scheduling
I0806 22:58:42.565463 1 gang.go:102] Victims from Gang plugins are []
I0806 22:58:42.565470 1 reclaim.go:145] No validated victims on Node <docker-desktop>: no victims
I0806 22:58:42.565479 1 reclaim.go:189] Leaving Reclaim ...
I0806 22:58:42.565582 1 cache.go:730] task unscheduleable default/test-job-new-bash-new-0, message: all nodes are unavailable: 1 node(s) resource fit failed., skip by no condition update
I0806 22:58:42.565653 1 session.go:187] Close Session d1cb963c-8e81-4e8c-b5e5-e920535ee55f
I0806 22:58:42.565661 1 scheduler.go:110] End scheduling ...
@Thor-wl @wpeng102 Can you help me to make fair sharing work? Is my Kubernetes version 1.21 a problem?
Client Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.0", GitCommit:"e19964183377d0ec2052d1f1fa930c4d7575bd50", GitTreeState:"clean", BuildDate:"2020-08-26T14:30:33Z", GoVersion:"go1.15", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.1", GitCommit:"5e58841cce77d4bc13713ad2b91fa0d961e69192", GitTreeState:"clean", BuildDate:"2021-05-12T14:12:29Z", GoVersion:"go1.16.4", Compiler:"gc", Platform:"linux/amd64"}
I0726 09:19:31.510263 1 gang.go:100] Victims from Gang plugins are []
hi @Sharathmk99 , for volcano v1.3.0, the most important log pasted l found was:
I0726 09:19:31.510246 1 proportion.go:236] Victims from proportion plugins are []
related logic is:
for _, reclaimee := range reclaimees {
...
allocated.Sub(reclaimee.Resreq)
if attr.deserved.LessEqualStrict(allocated) {
victims = append(victims, reclaimee)
}
}
klog.V(4).Infof("Victims from proportion plugins are %+v", victims)
for your queue test
, l found this log:
Queue <test>: deserved <cpu 10000.00, memory 0.00, hugepages-2Mi 0.00>, allocated <cpu 11000.00, memory 0.00>, share <1.1>
so the situation may be that allocated first subtracts a value, which leads to deserved > allocated, and eventually causes skippe the calculation of vicitim. It was a bug in v1.3.0, related fixes is https://github.com/volcano-sh/volcano/pull/1540
@shinytang6 Thank you for the response. I did tried to build docker image from master
branch and test it, but still it doesn't work.
Ideally test
queue should deserved 5000 and test1
should deserved 5000 and default
queue should deserved 1000. Do i need to share any other log?
Below is the log from master
branch docker images,
I0806 22:58:42.564073 1 scheduler.go:91] Start scheduling ...
I0806 22:58:42.564206 1 cache.go:840] The priority of job <default/test-job-new> is <high-pri/0>
I0806 22:58:42.564223 1 cache.go:840] The priority of job <default/test-job2> is </0>
I0806 22:58:42.564224 1 cache.go:840] The priority of job <default/test-job1> is </0>
I0806 22:58:42.564254 1 cache.go:840] The priority of job <default/test-job> is </0>
I0806 22:58:42.564305 1 cache.go:878] There are <4> Jobs, <3> Queues and <1> Nodes in total for scheduling.
I0806 22:58:42.564329 1 session.go:165] Open Session d1cb963c-8e81-4e8c-b5e5-e920535ee55f with <4> Job and <3> Queues
I0806 22:58:42.564345 1 proportion.go:73] The total resource is <cpu 12000.00, memory 26644484096.00, hugepages-2Mi 0.00>
I0806 22:58:42.564357 1 proportion.go:77] Considering Job <default/test-job2>.
I0806 22:58:42.564362 1 proportion.go:95] Added Queue <test> attributes.
I0806 22:58:42.564366 1 proportion.go:77] Considering Job <default/test-job1>.
I0806 22:58:42.564368 1 proportion.go:77] Considering Job <default/test-job>.
I0806 22:58:42.564371 1 proportion.go:77] Considering Job <default/test-job-new>.
I0806 22:58:42.564374 1 proportion.go:95] Added Queue <test1> attributes.
I0806 22:58:42.564385 1 proportion.go:153] Considering Queue <test>: weight <5>, total weight <10>.
I0806 22:58:42.564391 1 proportion.go:173] Format queue <test> deserved resource to <cpu 6000.00, memory 0.00, hugepages-2Mi 0.00>
I0806 22:58:42.564398 1 proportion.go:177] The attributes of queue <test> in proportion: deserved <cpu 6000.00, memory 0.00, hugepages-2Mi 0.00>, allocate <cpu 11000.00, memory 0.00>, request <cpu 11000.00, memory 0.00>, share <1.83>
I0806 22:58:42.564406 1 proportion.go:153] Considering Queue <test1>: weight <5>, total weight <10>.
I0806 22:58:42.564414 1 proportion.go:170] queue <test1> is meet
I0806 22:58:42.564418 1 proportion.go:177] The attributes of queue <test1> in proportion: deserved <cpu 500.00, memory 0.00>, allocate <cpu 0.00, memory 0.00>, request <cpu 500.00, memory 0.00>, share <0.00>
I0806 22:58:42.564442 1 proportion.go:189] Remaining resource is <cpu 5500.00, memory 26644484096.00, hugepages-2Mi 0.00>
I0806 22:58:42.564457 1 proportion.go:153] Considering Queue <test>: weight <5>, total weight <5>.
I0806 22:58:42.564467 1 proportion.go:170] queue <test> is meet
I0806 22:58:42.564476 1 proportion.go:177] The attributes of queue <test> in proportion: deserved <cpu 11000.00, memory 0.00>, allocate <cpu 11000.00, memory 0.00>, request <cpu 11000.00, memory 0.00>, share <1.00>
I0806 22:58:42.564484 1 proportion.go:153] Considering Queue <test1>: weight <5>, total weight <5>.
I0806 22:58:42.564490 1 proportion.go:189] Remaining resource is <cpu 500.00, memory 26644484096.00, hugepages-2Mi 0.00>
I0806 22:58:42.564514 1 proportion.go:142] Exiting when total weight is 0
I0806 22:58:42.564524 1 drf.go:206] Total Allocatable cpu 12000.00, memory 26644484096.00, hugepages-2Mi 0.00
I0806 22:58:42.564794 1 binpack.go:158] Enter binpack plugin ...
I0806 22:58:42.564814 1 binpack.go:177] resources [] record in weight but not found on any node
I0806 22:58:42.564820 1 binpack.go:161] Leaving binpack plugin. binpack.weight[1], binpack.cpu[1], binpack.memory[1], no extend resources. ...
I0806 22:58:42.564826 1 enqueue.go:44] Enter Enqueue ...
I0806 22:58:42.564830 1 enqueue.go:62] Added Queue <test1> for Job <default/test-job-new>
I0806 22:58:42.564834 1 enqueue.go:62] Added Queue <test> for Job <default/test-job2>
I0806 22:58:42.564838 1 enqueue.go:78] Try to enqueue PodGroup to 0 Queues
I0806 22:58:42.564842 1 enqueue.go:103] Leaving Enqueue ...
I0806 22:58:42.564846 1 allocate.go:43] Enter Allocate ...
I0806 22:58:42.564851 1 job_info.go:561] job test-job-new/default actual: map[bash-new:1], ji.TaskMinAvailable: map[bash-new:1]
I0806 22:58:42.564859 1 allocate.go:90] Added Job <default/test-job-new> into Queue <test1>
I0806 22:58:42.564862 1 job_info.go:561] job test-job2/default actual: map[bash:1], ji.TaskMinAvailable: map[bash:1]
I0806 22:58:42.564866 1 allocate.go:90] Added Job <default/test-job2> into Queue <test>
I0806 22:58:42.564869 1 job_info.go:561] job test-job1/default actual: map[bash:1], ji.TaskMinAvailable: map[bash:1]
I0806 22:58:42.564872 1 allocate.go:90] Added Job <default/test-job1> into Queue <test>
I0806 22:58:42.564875 1 priority.go:70] Priority JobOrderFn: <default/test-job1> priority: 0, <default/test-job2> priority: 0
I0806 22:58:42.564878 1 drf.go:413] DRF JobOrderFn: <default/test-job1> share state: 0.4166666666666667, <default/test-job2> share state: 0.08333333333333333
I0806 22:58:42.564882 1 job_info.go:561] job test-job/default actual: map[bash:1], ji.TaskMinAvailable: map[bash:1]
I0806 22:58:42.564906 1 allocate.go:90] Added Job <default/test-job> into Queue <test>
I0806 22:58:42.564916 1 priority.go:70] Priority JobOrderFn: <default/test-job> priority: 0, <default/test-job2> priority: 0
I0806 22:58:42.564934 1 drf.go:413] DRF JobOrderFn: <default/test-job> share state: 0.4166666666666667, <default/test-job2> share state: 0.08333333333333333
I0806 22:58:42.564955 1 allocate.go:94] Try to allocate resource to 1 Namespaces
I0806 22:58:42.564958 1 allocate.go:109] unlockedNode ID: 62db948c-9907-4163-b4cc-a03f9741ea2d, Name: docker-desktop
I0806 22:58:42.564987 1 allocate.go:162] Try to allocate resource to Jobs in Namespace <default> Queue <test1>
I0806 22:58:42.564994 1 allocate.go:196] Try to allocate resource to 1 tasks of Job <default/test-job-new>
I0806 22:58:42.564998 1 allocate.go:204] There are <1> nodes for Job <default/test-job-new>
I0806 22:58:42.565063 1 scheduler_helper.go:97] Considering Task <default/test-job-new-bash-new-0> on node <docker-desktop>: <cpu 500.00, memory 0.00> vs. <cpu 150.00, memory 26392825856.00, hugepages-2Mi 0.00>
I0806 22:58:42.565084 1 scheduler_helper.go:102] Predicates failed for task <default/test-job-new-bash-new-0> on node <docker-desktop>: task default/test-job-new-bash-new-0 on node docker-desktop fit failed: node(s) resource fit failed
I0806 22:58:42.565107 1 statement.go:351] Discarding operations ...
I0806 22:58:42.565115 1 allocate.go:162] Try to allocate resource to Jobs in Namespace <default> Queue <test>
I0806 22:58:42.565119 1 priority.go:70] Priority JobOrderFn: <default/test-job1> priority: 0, <default/test-job> priority: 0
I0806 22:58:42.565121 1 drf.go:413] DRF JobOrderFn: <default/test-job1> share state: 0.4166666666666667, <default/test-job> share state: 0.4166666666666667
I0806 22:58:42.565125 1 gang.go:118] Gang JobOrderFn: <default/test-job1> is ready: true, <default/test-job> is ready: true
I0806 22:58:42.565131 1 allocate.go:196] Try to allocate resource to 0 tasks of Job <default/test-job2>
I0806 22:58:42.565135 1 statement.go:376] Committing operations ...
I0806 22:58:42.565142 1 allocate.go:162] Try to allocate resource to Jobs in Namespace <default> Queue <test>
I0806 22:58:42.565148 1 allocate.go:196] Try to allocate resource to 0 tasks of Job <default/test-job>
I0806 22:58:42.565150 1 statement.go:376] Committing operations ...
I0806 22:58:42.565182 1 allocate.go:162] Try to allocate resource to Jobs in Namespace <default> Queue <test>
I0806 22:58:42.565226 1 allocate.go:196] Try to allocate resource to 0 tasks of Job <default/test-job1>
I0806 22:58:42.565232 1 statement.go:376] Committing operations ...
I0806 22:58:42.565248 1 allocate.go:158] Namespace <default> have no queue, skip it
I0806 22:58:42.565272 1 allocate.go:275] Leaving Allocate ...
I0806 22:58:42.565278 1 backfill.go:41] Enter Backfill ...
I0806 22:58:42.565281 1 job_info.go:561] job test-job/default actual: map[bash:1], ji.TaskMinAvailable: map[bash:1]
I0806 22:58:42.565287 1 job_info.go:561] job test-job-new/default actual: map[bash-new:1], ji.TaskMinAvailable: map[bash-new:1]
I0806 22:58:42.565294 1 job_info.go:561] job test-job2/default actual: map[bash:1], ji.TaskMinAvailable: map[bash:1]
I0806 22:58:42.565319 1 job_info.go:561] job test-job1/default actual: map[bash:1], ji.TaskMinAvailable: map[bash:1]
I0806 22:58:42.565348 1 backfill.go:90] Leaving Backfill ...
I0806 22:58:42.565352 1 reclaim.go:41] Enter Reclaim ...
I0806 22:58:42.565355 1 reclaim.go:50] There are <4> Jobs and <3> Queues in total for scheduling.
I0806 22:58:42.565359 1 job_info.go:561] job test-job-new/default actual: map[bash-new:1], ji.TaskMinAvailable: map[bash-new:1]
I0806 22:58:42.565364 1 reclaim.go:67] Added Queue <test1> for Job <default/test-job-new>
I0806 22:58:42.565367 1 job_info.go:561] job test-job2/default actual: map[bash:1], ji.TaskMinAvailable: map[bash:1]
I0806 22:58:42.565372 1 reclaim.go:67] Added Queue <test> for Job <default/test-job2>
I0806 22:58:42.565375 1 job_info.go:561] job test-job1/default actual: map[bash:1], ji.TaskMinAvailable: map[bash:1]
I0806 22:58:42.565379 1 job_info.go:561] job test-job/default actual: map[bash:1], ji.TaskMinAvailable: map[bash:1]
I0806 22:58:42.565393 1 reclaim.go:121] Considering Task <default/test-job-new-bash-new-0> on Node <docker-desktop>.
I0806 22:58:42.565429 1 proportion.go:234] Victims from proportion plugins are []
I0806 22:58:42.565434 1 gang.go:97] Can not preempt task <default/test-job1-bash-0> because job test-job1 ready num(1) <= MinAvailable(1) for gang-scheduling
I0806 22:58:42.565437 1 gang.go:97] Can not preempt task <default/test-job2-bash-0> because job test-job2 ready num(1) <= MinAvailable(1) for gang-scheduling
I0806 22:58:42.565460 1 gang.go:97] Can not preempt task <default/test-job-bash-0> because job test-job ready num(1) <= MinAvailable(1) for gang-scheduling
I0806 22:58:42.565463 1 gang.go:102] Victims from Gang plugins are []
I0806 22:58:42.565470 1 reclaim.go:145] No validated victims on Node <docker-desktop>: no victims
I0806 22:58:42.565479 1 reclaim.go:189] Leaving Reclaim ...
I0806 22:58:42.565582 1 cache.go:730] task unscheduleable default/test-job-new-bash-new-0, message: all nodes are unavailable: 1 node(s) resource fit failed., skip by no condition update
I0806 22:58:42.565653 1 session.go:187] Close Session d1cb963c-8e81-4e8c-b5e5-e920535ee55f
I0806 22:58:42.565661 1 scheduler.go:110] End scheduling ...
@shinytang6 Thank you for the response. I did tried to build docker image from
master
branch and test it, but still it doesn't work.Ideally
test
queue should deserved 5000 andtest1
should deserved 5000 anddefault
queue should deserved 1000. Do i need to share any other log?Below is the log from
master
branch docker images,I0806 22:58:42.564073 1 scheduler.go:91] Start scheduling ... I0806 22:58:42.564206 1 cache.go:840] The priority of job <default/test-job-new> is <high-pri/0> I0806 22:58:42.564223 1 cache.go:840] The priority of job <default/test-job2> is </0> I0806 22:58:42.564224 1 cache.go:840] The priority of job <default/test-job1> is </0> I0806 22:58:42.564254 1 cache.go:840] The priority of job <default/test-job> is </0> I0806 22:58:42.564305 1 cache.go:878] There are <4> Jobs, <3> Queues and <1> Nodes in total for scheduling. I0806 22:58:42.564329 1 session.go:165] Open Session d1cb963c-8e81-4e8c-b5e5-e920535ee55f with <4> Job and <3> Queues I0806 22:58:42.564345 1 proportion.go:73] The total resource is <cpu 12000.00, memory 26644484096.00, hugepages-2Mi 0.00> I0806 22:58:42.564357 1 proportion.go:77] Considering Job <default/test-job2>. I0806 22:58:42.564362 1 proportion.go:95] Added Queue <test> attributes. I0806 22:58:42.564366 1 proportion.go:77] Considering Job <default/test-job1>. I0806 22:58:42.564368 1 proportion.go:77] Considering Job <default/test-job>. I0806 22:58:42.564371 1 proportion.go:77] Considering Job <default/test-job-new>. I0806 22:58:42.564374 1 proportion.go:95] Added Queue <test1> attributes. I0806 22:58:42.564385 1 proportion.go:153] Considering Queue <test>: weight <5>, total weight <10>. I0806 22:58:42.564391 1 proportion.go:173] Format queue <test> deserved resource to <cpu 6000.00, memory 0.00, hugepages-2Mi 0.00> I0806 22:58:42.564398 1 proportion.go:177] The attributes of queue <test> in proportion: deserved <cpu 6000.00, memory 0.00, hugepages-2Mi 0.00>, allocate <cpu 11000.00, memory 0.00>, request <cpu 11000.00, memory 0.00>, share <1.83> I0806 22:58:42.564406 1 proportion.go:153] Considering Queue <test1>: weight <5>, total weight <10>. I0806 22:58:42.564414 1 proportion.go:170] queue <test1> is meet I0806 22:58:42.564418 1 proportion.go:177] The attributes of queue <test1> in proportion: deserved <cpu 500.00, memory 0.00>, allocate <cpu 0.00, memory 0.00>, request <cpu 500.00, memory 0.00>, share <0.00> I0806 22:58:42.564442 1 proportion.go:189] Remaining resource is <cpu 5500.00, memory 26644484096.00, hugepages-2Mi 0.00> I0806 22:58:42.564457 1 proportion.go:153] Considering Queue <test>: weight <5>, total weight <5>. I0806 22:58:42.564467 1 proportion.go:170] queue <test> is meet I0806 22:58:42.564476 1 proportion.go:177] The attributes of queue <test> in proportion: deserved <cpu 11000.00, memory 0.00>, allocate <cpu 11000.00, memory 0.00>, request <cpu 11000.00, memory 0.00>, share <1.00> I0806 22:58:42.564484 1 proportion.go:153] Considering Queue <test1>: weight <5>, total weight <5>. I0806 22:58:42.564490 1 proportion.go:189] Remaining resource is <cpu 500.00, memory 26644484096.00, hugepages-2Mi 0.00> I0806 22:58:42.564514 1 proportion.go:142] Exiting when total weight is 0 I0806 22:58:42.564524 1 drf.go:206] Total Allocatable cpu 12000.00, memory 26644484096.00, hugepages-2Mi 0.00 I0806 22:58:42.564794 1 binpack.go:158] Enter binpack plugin ... I0806 22:58:42.564814 1 binpack.go:177] resources [] record in weight but not found on any node I0806 22:58:42.564820 1 binpack.go:161] Leaving binpack plugin. binpack.weight[1], binpack.cpu[1], binpack.memory[1], no extend resources. ... I0806 22:58:42.564826 1 enqueue.go:44] Enter Enqueue ... I0806 22:58:42.564830 1 enqueue.go:62] Added Queue <test1> for Job <default/test-job-new> I0806 22:58:42.564834 1 enqueue.go:62] Added Queue <test> for Job <default/test-job2> I0806 22:58:42.564838 1 enqueue.go:78] Try to enqueue PodGroup to 0 Queues I0806 22:58:42.564842 1 enqueue.go:103] Leaving Enqueue ... I0806 22:58:42.564846 1 allocate.go:43] Enter Allocate ... I0806 22:58:42.564851 1 job_info.go:561] job test-job-new/default actual: map[bash-new:1], ji.TaskMinAvailable: map[bash-new:1] I0806 22:58:42.564859 1 allocate.go:90] Added Job <default/test-job-new> into Queue <test1> I0806 22:58:42.564862 1 job_info.go:561] job test-job2/default actual: map[bash:1], ji.TaskMinAvailable: map[bash:1] I0806 22:58:42.564866 1 allocate.go:90] Added Job <default/test-job2> into Queue <test> I0806 22:58:42.564869 1 job_info.go:561] job test-job1/default actual: map[bash:1], ji.TaskMinAvailable: map[bash:1] I0806 22:58:42.564872 1 allocate.go:90] Added Job <default/test-job1> into Queue <test> I0806 22:58:42.564875 1 priority.go:70] Priority JobOrderFn: <default/test-job1> priority: 0, <default/test-job2> priority: 0 I0806 22:58:42.564878 1 drf.go:413] DRF JobOrderFn: <default/test-job1> share state: 0.4166666666666667, <default/test-job2> share state: 0.08333333333333333 I0806 22:58:42.564882 1 job_info.go:561] job test-job/default actual: map[bash:1], ji.TaskMinAvailable: map[bash:1] I0806 22:58:42.564906 1 allocate.go:90] Added Job <default/test-job> into Queue <test> I0806 22:58:42.564916 1 priority.go:70] Priority JobOrderFn: <default/test-job> priority: 0, <default/test-job2> priority: 0 I0806 22:58:42.564934 1 drf.go:413] DRF JobOrderFn: <default/test-job> share state: 0.4166666666666667, <default/test-job2> share state: 0.08333333333333333 I0806 22:58:42.564955 1 allocate.go:94] Try to allocate resource to 1 Namespaces I0806 22:58:42.564958 1 allocate.go:109] unlockedNode ID: 62db948c-9907-4163-b4cc-a03f9741ea2d, Name: docker-desktop I0806 22:58:42.564987 1 allocate.go:162] Try to allocate resource to Jobs in Namespace <default> Queue <test1> I0806 22:58:42.564994 1 allocate.go:196] Try to allocate resource to 1 tasks of Job <default/test-job-new> I0806 22:58:42.564998 1 allocate.go:204] There are <1> nodes for Job <default/test-job-new> I0806 22:58:42.565063 1 scheduler_helper.go:97] Considering Task <default/test-job-new-bash-new-0> on node <docker-desktop>: <cpu 500.00, memory 0.00> vs. <cpu 150.00, memory 26392825856.00, hugepages-2Mi 0.00> I0806 22:58:42.565084 1 scheduler_helper.go:102] Predicates failed for task <default/test-job-new-bash-new-0> on node <docker-desktop>: task default/test-job-new-bash-new-0 on node docker-desktop fit failed: node(s) resource fit failed I0806 22:58:42.565107 1 statement.go:351] Discarding operations ... I0806 22:58:42.565115 1 allocate.go:162] Try to allocate resource to Jobs in Namespace <default> Queue <test> I0806 22:58:42.565119 1 priority.go:70] Priority JobOrderFn: <default/test-job1> priority: 0, <default/test-job> priority: 0 I0806 22:58:42.565121 1 drf.go:413] DRF JobOrderFn: <default/test-job1> share state: 0.4166666666666667, <default/test-job> share state: 0.4166666666666667 I0806 22:58:42.565125 1 gang.go:118] Gang JobOrderFn: <default/test-job1> is ready: true, <default/test-job> is ready: true I0806 22:58:42.565131 1 allocate.go:196] Try to allocate resource to 0 tasks of Job <default/test-job2> I0806 22:58:42.565135 1 statement.go:376] Committing operations ... I0806 22:58:42.565142 1 allocate.go:162] Try to allocate resource to Jobs in Namespace <default> Queue <test> I0806 22:58:42.565148 1 allocate.go:196] Try to allocate resource to 0 tasks of Job <default/test-job> I0806 22:58:42.565150 1 statement.go:376] Committing operations ... I0806 22:58:42.565182 1 allocate.go:162] Try to allocate resource to Jobs in Namespace <default> Queue <test> I0806 22:58:42.565226 1 allocate.go:196] Try to allocate resource to 0 tasks of Job <default/test-job1> I0806 22:58:42.565232 1 statement.go:376] Committing operations ... I0806 22:58:42.565248 1 allocate.go:158] Namespace <default> have no queue, skip it I0806 22:58:42.565272 1 allocate.go:275] Leaving Allocate ... I0806 22:58:42.565278 1 backfill.go:41] Enter Backfill ... I0806 22:58:42.565281 1 job_info.go:561] job test-job/default actual: map[bash:1], ji.TaskMinAvailable: map[bash:1] I0806 22:58:42.565287 1 job_info.go:561] job test-job-new/default actual: map[bash-new:1], ji.TaskMinAvailable: map[bash-new:1] I0806 22:58:42.565294 1 job_info.go:561] job test-job2/default actual: map[bash:1], ji.TaskMinAvailable: map[bash:1] I0806 22:58:42.565319 1 job_info.go:561] job test-job1/default actual: map[bash:1], ji.TaskMinAvailable: map[bash:1] I0806 22:58:42.565348 1 backfill.go:90] Leaving Backfill ... I0806 22:58:42.565352 1 reclaim.go:41] Enter Reclaim ... I0806 22:58:42.565355 1 reclaim.go:50] There are <4> Jobs and <3> Queues in total for scheduling. I0806 22:58:42.565359 1 job_info.go:561] job test-job-new/default actual: map[bash-new:1], ji.TaskMinAvailable: map[bash-new:1] I0806 22:58:42.565364 1 reclaim.go:67] Added Queue <test1> for Job <default/test-job-new> I0806 22:58:42.565367 1 job_info.go:561] job test-job2/default actual: map[bash:1], ji.TaskMinAvailable: map[bash:1] I0806 22:58:42.565372 1 reclaim.go:67] Added Queue <test> for Job <default/test-job2> I0806 22:58:42.565375 1 job_info.go:561] job test-job1/default actual: map[bash:1], ji.TaskMinAvailable: map[bash:1] I0806 22:58:42.565379 1 job_info.go:561] job test-job/default actual: map[bash:1], ji.TaskMinAvailable: map[bash:1] I0806 22:58:42.565393 1 reclaim.go:121] Considering Task <default/test-job-new-bash-new-0> on Node <docker-desktop>. I0806 22:58:42.565429 1 proportion.go:234] Victims from proportion plugins are [] I0806 22:58:42.565434 1 gang.go:97] Can not preempt task <default/test-job1-bash-0> because job test-job1 ready num(1) <= MinAvailable(1) for gang-scheduling I0806 22:58:42.565437 1 gang.go:97] Can not preempt task <default/test-job2-bash-0> because job test-job2 ready num(1) <= MinAvailable(1) for gang-scheduling I0806 22:58:42.565460 1 gang.go:97] Can not preempt task <default/test-job-bash-0> because job test-job ready num(1) <= MinAvailable(1) for gang-scheduling I0806 22:58:42.565463 1 gang.go:102] Victims from Gang plugins are [] I0806 22:58:42.565470 1 reclaim.go:145] No validated victims on Node <docker-desktop>: no victims I0806 22:58:42.565479 1 reclaim.go:189] Leaving Reclaim ... I0806 22:58:42.565582 1 cache.go:730] task unscheduleable default/test-job-new-bash-new-0, message: all nodes are unavailable: 1 node(s) resource fit failed., skip by no condition update I0806 22:58:42.565653 1 session.go:187] Close Session d1cb963c-8e81-4e8c-b5e5-e920535ee55f I0806 22:58:42.565661 1 scheduler.go:110] End scheduling ...
l will take a look for that, my intuition is that there are still some potential bugs in proportion plugin..
Hello 👋 Looks like there was no activity on this issue for last 90 days. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity for 60 days, this issue will be closed (we can always reopen an issue if we need!).
Hello 👋 Looks like there was no activity on this issue for last 90 days. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity for 60 days, this issue will be closed (we can always reopen an issue if we need!).
Closing for now as there was no activity for last 60 days after marked as stale, let us know if you need this to be reopened! 🤗
Hello 👋 Looks like there was no activity on this issue for last 90 days. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity for 60 days, this issue will be closed (we can always reopen an issue if we need!).
Closing for now as there was no activity for last 60 days after marked as stale, let us know if you need this to be reopened! 🤗