volcano icon indicating copy to clipboard operation
volcano copied to clipboard

Fair sharing not working

Open Sharathmk99 opened this issue 2 years ago • 24 comments

What happened: My cluster has total 11 CPU. I'm trying to create 2 queue(excluding default queue) with weight 5 for each queue. Queue manifest,

apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
  name: test
spec:
  weight: 5

---

apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
  name: test1
spec:
  weight: 5

Queue List,

Name                     Weight  State   Inqueue Pending Running Unknown
default                  1       Open    0       0       0       0
test                     5       Open    0       0       0       0
test1                    5       Open    0       0       0       0

Created 3 Jobs for test queue with CPU resource as follow, job1 -> CPU 5 job2 -> CPU 5 job3 -> CPU 1

Now all 3 jobs are running and utilizing full cluster.

Now i'm creating new Job in test1 queue with CPU 2. I'm expecting 1 Job will be evicted from test queue and Job in test1 queue will be running. But Job in test1 queue is in Inqueue state.

Name                     Weight  State   Inqueue Pending Running Unknown
default                  1       Open    0       0       0       0
test                     5       Open    0       0       3       0
test1                    5       Open    1       0       0       0

Configuration,

actions: "enqueue, allocate, backfill"
tiers:
- plugins:
  - name: priority
  - name: gang
  - name: conformance
- plugins:
  - name: drf
  - name: predicates
  - name: proportion
  - name: nodeorder
  - name: binpack

What you expected to happen: I'm expecting 1 Job will be evicted from test queue and Job in test1 queue will be running. But Job in test1 queue is in Inqueue state. How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

  • Volcano Version: v1.3.0
  • Kubernetes version (use kubectl version):
  • Cloud provider or hardware configuration:
  • OS (e.g. from /etc/os-release):
  • Kernel (e.g. uname -a):
  • Install tools:
  • Others:

Sharathmk99 avatar Jul 14 '21 21:07 Sharathmk99

Can you help fot that? @renhuanyu

Thor-wl avatar Jul 15 '21 01:07 Thor-wl

@Sharathmk99 please config reclaim action and try again.

actions: "enqueue, allocate, backfill, reclaim"
tiers:
- plugins:
  - name: priority
  - name: gang
  - name: conformance
- plugins:
  - name: drf
  - name: predicates
  - name: proportion
  - name: nodeorder
  - name: binpack

wpeng102 avatar Jul 15 '21 01:07 wpeng102

actions should include reclaim if you want to use Fair sharing feature

lowang-bh avatar Jul 15 '21 03:07 lowang-bh

@wpeng102 @lowang-bh Included reclaim in action. Still same issue,

Name                     Weight  State   Inqueue Pending Running Unknown
default                  1       Open    0       0       0       0
test                     5       Open    0       0       3       0
test1                    5       Open    1       0       0       0

Config,

actions: "enqueue, allocate, backfill, reclaim"
tiers:
- plugins:
  - name: priority
  - name: gang
  - name: conformance
- plugins:
  - name: drf
  - name: predicates
  - name: proportion
  - name: nodeorder
  - name: binpack

Restarted all 3 pods, kubectl rollout restart deployment -n volcano-system

Do I need to share any other details?

Sharathmk99 avatar Jul 15 '21 07:07 Sharathmk99

@lowang-bh @wpeng102 Is my Job manifest correct?

apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: test-job
  namespace: default
spec:
  schedulerName: volcano
  policies:
    - event: TaskCompleted
      action: CompleteJob
  queue: test1
  tasks:
  - replicas: 1
    name: "bash"
    policies:
      - event: TaskCompleted
        action: CompleteJob
    template:
      metadata:
        labels:
          app: bash
      spec:
        containers:
          - name: bash
            image: bash
            command: ["bash", "-c", "echo 'sleep...'; sleep 300"]
            resources:
              requests:
                cpu: "2"
        restartPolicy: Never

Sharathmk99 avatar Jul 15 '21 07:07 Sharathmk99

@wpeng102 @lowang-bh

Description of PodGroup,

kubectl describe podgroup -n default test-job

Name:         test-job
Namespace:    default
Labels:       <none>
Annotations:  <none>
API Version:  scheduling.volcano.sh/v1beta1
Kind:         PodGroup
Metadata:
  Creation Timestamp:  2021-07-15T08:00:24Z
  Generation:          5
  Managed Fields:
    API Version:  scheduling.volcano.sh/v1beta1
    Fields Type:  FieldsV1
    fieldsV1:
      f:status:
        f:conditions:
        f:phase:
    Manager:      vc-scheduler
    Operation:    Update
    Time:         2021-07-15T08:00:25Z
    API Version:  scheduling.volcano.sh/v1beta1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .:
          f:kubectl.kubernetes.io/last-applied-configuration:
        f:ownerReferences:
          .:
          k:{"uid":"dc958eed-ee1d-4d7c-a668-cc44df37bafb"}:
            .:
            f:apiVersion:
            f:blockOwnerDeletion:
            f:controller:
            f:kind:
            f:name:
            f:uid:
      f:spec:
        .:
        f:minMember:
        f:minResources:
          .:
          f:cpu:
        f:minTaskMember:
          .:
          f:bash:
        f:queue:
      f:status:
        .:
        f:conditions:
        f:phase:
    Manager:    vc-controller-manager
    Operation:  Update
    Time:       2021-07-15T08:00:26Z
  Owner References:
    API Version:           batch.volcano.sh/v1alpha1
    Block Owner Deletion:  true
    Controller:            true
    Kind:                  Job
    Name:                  test-job
    UID:                   dc958eed-ee1d-4d7c-a668-cc44df37bafb
  Resource Version:        184342
  UID:                     805f905b-02d3-498e-9846-770caeeed38f
Spec:
  Min Member:  1
  Min Resources:
    Cpu:  2
  Min Task Member:
    Bash:  1
  Queue:   test1
Status:
  Conditions:
    Last Transition Time:  2021-07-15T08:00:26Z
    Message:               1/0 tasks in gang unschedulable: pod group is not ready, 1 minAvailable.
    Reason:                NotEnoughResources
    Status:                True
    Transition ID:         444ca19c-bbed-4855-b75a-38d81995bd52
    Type:                  Unschedulable
  Phase:                   Inqueue
Events:
  Type     Reason         Age                From     Message
  ----     ------         ----               ----     -------
  Warning  Unschedulable  15s (x2 over 16s)  volcano  0/0 tasks in gang unschedulable: pod group is not ready, 1 minAvailable.
  Warning  Unschedulable  1s (x14 over 14s)  volcano  1/1 tasks in gang unschedulable: pod group is not ready, 1 Pending, 1 minAvailable.

Pod describe events,

Events:
  Type     Reason            Age    From     Message
  ----     ------            ----   ----     -------
  Warning  FailedScheduling  3m19s  volcano  all nodes are unavailable: 1 node(s) resource fit failed.

Logs of Scheduler,

kubectl logs -n volcano-system volcano-scheduler-595c747db-bx2g9

I0715 07:53:38.463612       1 session.go:151] Open Session f887351a-bada-4666-97a7-43e1e22ad2a7 with <4> Job and <3> Queues
I0715 07:53:38.463907       1 enqueue.go:44] Enter Enqueue ...
I0715 07:53:38.463925       1 enqueue.go:62] Added Queue <test> for Job <test-ns/test-job>
I0715 07:53:38.463930       1 enqueue.go:62] Added Queue <test1> for Job <default/test-job>
I0715 07:53:38.463935       1 enqueue.go:78] Try to enqueue PodGroup to 0 Queues
I0715 07:53:38.463938       1 enqueue.go:102] Leaving Enqueue ...
I0715 07:53:38.463943       1 allocate.go:43] Enter Allocate ...
I0715 07:53:38.463947       1 job_info.go:555] job test-job/default actual: map[bash:1], ji.TaskMinAvailable: map[]
I0715 07:53:38.463955       1 job_info.go:555] job test-job2/test-ns actual: map[bash:1], ji.TaskMinAvailable: map[]
I0715 07:53:38.463962       1 job_info.go:555] job test-job/test-ns actual: map[bash:1], ji.TaskMinAvailable: map[]
I0715 07:53:38.463967       1 job_info.go:555] job test-job1/test-ns actual: map[bash:1], ji.TaskMinAvailable: map[]
I0715 07:53:38.463973       1 allocate.go:94] Try to allocate resource to 2 Namespaces
I0715 07:53:38.463980       1 allocate.go:158] Try to allocate resource to Jobs in Namespace <default> Queue <test1>
I0715 07:53:38.463986       1 allocate.go:192] Try to allocate resource to 1 tasks of Job <default/test-job>
I0715 07:53:38.463990       1 allocate.go:200] There are <1> nodes for Job <default/test-job>
I0715 07:53:38.464027       1 scheduler_helper.go:103] Predicates failed for task <default/test-job-bash-0> on node <docker-desktop>: task default/test-job-bash-0 on node docker-desktop fit failed: node(s) resource fit failed
I0715 07:53:38.464057       1 statement.go:353] Discarding operations ...
I0715 07:53:38.464079       1 allocate.go:158] Try to allocate resource to Jobs in Namespace <default> Queue <test1>
I0715 07:53:38.464090       1 allocate.go:154] Namespace <default> have no queue, skip it
I0715 07:53:38.464093       1 proportion.go:247] Queue <test>: deserved <cpu 10000.00, memory 0.00, hugepages-2Mi 0.00>, allocated <cpu 11000.00, memory 0.00>, share <1.1>
I0715 07:53:38.464104       1 allocate.go:143] Namespace <test-ns> Queue <test> is overused, ignore it.
I0715 07:53:38.464111       1 allocate.go:154] Namespace <test-ns> have no queue, skip it
I0715 07:53:38.464120       1 allocate.go:271] Leaving Allocate ...
I0715 07:53:38.464124       1 backfill.go:41] Enter Backfill ...
I0715 07:53:38.464128       1 job_info.go:555] job test-job/default actual: map[bash:1], ji.TaskMinAvailable: map[]
I0715 07:53:38.464133       1 job_info.go:555] job test-job2/test-ns actual: map[bash:1], ji.TaskMinAvailable: map[]
I0715 07:53:38.464138       1 job_info.go:555] job test-job/test-ns actual: map[bash:1], ji.TaskMinAvailable: map[]
I0715 07:53:38.464141       1 job_info.go:555] job test-job1/test-ns actual: map[bash:1], ji.TaskMinAvailable: map[]
I0715 07:53:38.464146       1 backfill.go:91] Leaving Backfill ...
I0715 07:53:38.464150       1 reclaim.go:41] Enter Reclaim ...
I0715 07:53:38.464154       1 reclaim.go:50] There are <4> Jobs and <3> Queues in total for scheduling.
I0715 07:53:38.464157       1 job_info.go:555] job test-job2/test-ns actual: map[bash:1], ji.TaskMinAvailable: map[]
I0715 07:53:38.464162       1 job_info.go:555] job test-job/test-ns actual: map[bash:1], ji.TaskMinAvailable: map[]
I0715 07:53:38.464166       1 job_info.go:555] job test-job1/test-ns actual: map[bash:1], ji.TaskMinAvailable: map[]
I0715 07:53:38.464171       1 job_info.go:555] job test-job/default actual: map[bash:1], ji.TaskMinAvailable: map[]
I0715 07:53:38.464184       1 reclaim.go:124] Considering Task <default/test-job-bash-0> on Node <docker-desktop>.
I0715 07:53:38.464203       1 reclaim.go:148] No validated victims on Node <docker-desktop>: no victims
I0715 07:53:38.464207       1 proportion.go:247] Queue <test>: deserved <cpu 10000.00, memory 0.00, hugepages-2Mi 0.00>, allocated <cpu 11000.00, memory 0.00>, share <1.1>
I0715 07:53:38.464212       1 reclaim.go:95] Queue <test> is overused, ignore it.
I0715 07:53:38.464217       1 reclaim.go:189] Leaving Reclaim ...
I0715 07:53:38.464336       1 session.go:170] Close Session f887351a-bada-4666-97a7-43e1e22ad2a7

Sharathmk99 avatar Jul 15 '21 07:07 Sharathmk99

@wpeng102 @Thor-wl @renhuanyu could you please help me to figure out the issue?

Sharathmk99 avatar Jul 16 '21 04:07 Sharathmk99

reclaim works when multiple conditions met the requirement: you can check it from AddReclaimableFn

  1. gang plugin: preemptable := job.MinAvailable == 0 || job.MinAvailable <= job.ReadyTaskNum()-1
  2. conformance-plugin: evictor can not reclaim pod in system namespace or with system-priority
  3. drf plugin:
  4. proportion : victimee's queue derserve large than its allocated(that means it is overused)

lowang-bh avatar Jul 16 '21 06:07 lowang-bh

@Sharathmk99 could you change the scheduler log level to V=4 and append the scheduler log again? you can edit the scheduler deployment and modify -v=4 in container args

https://github.com/volcano-sh/volcano/blob/44ec8eb28df26fbacd03ab15edabfc5916900c25/installer/volcano-development.yaml#L7592-L7595

wpeng102 avatar Jul 22 '21 07:07 wpeng102

@wpeng102 Please find below logs,

I0722 20:42:56.019764       1 session.go:151] Open Session 68f9ce6a-7517-4b0e-a8a5-ab3c1f4ff804 with <4> Job and <3> Queues
I0722 20:42:56.020020       1 proportion.go:75] The total resource is <cpu 12000.00, memory 26644484096.00, hugepages-2Mi 0.00>
I0722 20:42:56.020075       1 proportion.go:79] Considering Job <default/test-job1>.
I0722 20:42:56.020088       1 proportion.go:97] Added Queue <test> attributes.
I0722 20:42:56.020099       1 proportion.go:79] Considering Job <default/test-job>.
I0722 20:42:56.020106       1 proportion.go:79] Considering Job <default/test-job2>.
I0722 20:42:56.020113       1 proportion.go:79] Considering Job <default/test-job-new>.
I0722 20:42:56.020121       1 proportion.go:97] Added Queue <test1> attributes.
I0722 20:42:56.020172       1 proportion.go:155] Considering Queue <test>: weight <5>, total weight <10>.
I0722 20:42:56.020224       1 proportion.go:175] Format queue <test> deserved resource to <cpu 6000.00, memory 0.00, hugepages-2Mi 0.00>
I0722 20:42:56.020273       1 proportion.go:179] The attributes of queue <test> in proportion: deserved <cpu 6000.00, memory 0.00, hugepages-2Mi 0.00>, allocate <cpu 11000.00, memory 0.00>, request <cpu 11000.00, memory 0.00>, share <1.83>
I0722 20:42:56.020353       1 proportion.go:155] Considering Queue <test1>: weight <5>, total weight <10>.
I0722 20:42:56.020386       1 proportion.go:172] queue <test1> is meet
I0722 20:42:56.020426       1 proportion.go:179] The attributes of queue <test1> in proportion: deserved <cpu 500.00, memory 0.00>, allocate <cpu 0.00, memory 0.00>, request <cpu 500.00, memory 0.00>, share <0.00>
I0722 20:42:56.020452       1 proportion.go:191] Remaining resource is  <cpu 5500.00, memory 26644484096.00, hugepages-2Mi 0.00>
I0722 20:42:56.020474       1 proportion.go:155] Considering Queue <test>: weight <5>, total weight <5>.
I0722 20:42:56.020488       1 proportion.go:172] queue <test> is meet
I0722 20:42:56.020515       1 proportion.go:179] The attributes of queue <test> in proportion: deserved <cpu 11000.00, memory 0.00>, allocate <cpu 11000.00, memory 0.00>, request <cpu 11000.00, memory 0.00>, share <1.00>
I0722 20:42:56.020532       1 proportion.go:155] Considering Queue <test1>: weight <5>, total weight <5>.
I0722 20:42:56.020577       1 proportion.go:191] Remaining resource is  <cpu 500.00, memory 26644484096.00, hugepages-2Mi 0.00>
I0722 20:42:56.020598       1 proportion.go:144] Exiting when total weight is 0
I0722 20:42:56.020919       1 binpack.go:158] Enter binpack plugin ...
I0722 20:42:56.020957       1 binpack.go:177] resources [] record in weight but not found on any node
I0722 20:42:56.020969       1 binpack.go:161] Leaving binpack plugin. binpack.weight[1], binpack.cpu[1], binpack.memory[1], no extend resources. ...
I0722 20:42:56.020984       1 drf.go:207] Total Allocatable cpu 12000.00, memory 26644484096.00, hugepages-2Mi 0.00
I0722 20:42:56.021018       1 enqueue.go:44] Enter Enqueue ...
I0722 20:42:56.021055       1 enqueue.go:62] Added Queue <test> for Job <default/test-job>
I0722 20:42:56.021067       1 enqueue.go:62] Added Queue <test1> for Job <default/test-job-new>
I0722 20:42:56.021124       1 enqueue.go:78] Try to enqueue PodGroup to 0 Queues
I0722 20:42:56.021155       1 enqueue.go:102] Leaving Enqueue ...
I0722 20:42:56.021167       1 allocate.go:43] Enter Allocate ...
I0722 20:42:56.021176       1 job_info.go:555] job test-job1/default actual: map[bash:1], ji.TaskMinAvailable: map[]
I0722 20:42:56.021192       1 allocate.go:90] Added Job <default/test-job1> into Queue <test>
I0722 20:42:56.021201       1 job_info.go:555] job test-job/default actual: map[bash:1], ji.TaskMinAvailable: map[]
I0722 20:42:56.021210       1 allocate.go:90] Added Job <default/test-job> into Queue <test>
I0722 20:42:56.021217       1 priority.go:69] Priority JobOrderFn: <default/test-job> priority: 0, <default/test-job1> priority: 0
I0722 20:42:56.021224       1 gang.go:116] Gang JobOrderFn: <default/test-job> is ready: true, <default/test-job1> is ready: true
I0722 20:42:56.021230       1 drf.go:414] DRF JobOrderFn: <default/test-job> share state: 0.4166666666666667, <default/test-job1> share state: 0.4166666666666667
I0722 20:42:56.021242       1 job_info.go:555] job test-job2/default actual: map[bash:1], ji.TaskMinAvailable: map[]
I0722 20:42:56.021278       1 allocate.go:90] Added Job <default/test-job2> into Queue <test>
I0722 20:42:56.021285       1 priority.go:69] Priority JobOrderFn: <default/test-job2> priority: 0, <default/test-job> priority: 0
I0722 20:42:56.021291       1 gang.go:116] Gang JobOrderFn: <default/test-job2> is ready: true, <default/test-job> is ready: true
I0722 20:42:56.021321       1 drf.go:414] DRF JobOrderFn: <default/test-job2> share state: 0.08333333333333333, <default/test-job> share state: 0.4166666666666667
I0722 20:42:56.021334       1 job_info.go:555] job test-job-new/default actual: map[bash-new:1], ji.TaskMinAvailable: map[]
I0722 20:42:56.021345       1 allocate.go:90] Added Job <default/test-job-new> into Queue <test1>
I0722 20:42:56.021354       1 allocate.go:94] Try to allocate resource to 1 Namespaces
I0722 20:42:56.021363       1 allocate.go:109] unlockedNode ID: 62db948c-9907-4163-b4cc-a03f9741ea2d, Name: docker-desktop
I0722 20:42:56.021380       1 allocate.go:158] Try to allocate resource to Jobs in Namespace <default> Queue <test1>
I0722 20:42:56.021394       1 allocate.go:192] Try to allocate resource to 1 tasks of Job <default/test-job-new>
I0722 20:42:56.021430       1 allocate.go:200] There are <1> nodes for Job <default/test-job-new>
I0722 20:42:56.021525       1 scheduler_helper.go:98] Considering Task <default/test-job-new-bash-new-0> on node <docker-desktop>: <cpu 500.00, memory 0.00> vs. <cpu 150.00, memory 26392825856.00, hugepages-2Mi 0.00>
I0722 20:42:56.021588       1 scheduler_helper.go:103] Predicates failed for task <default/test-job-new-bash-new-0> on node <docker-desktop>: task default/test-job-new-bash-new-0 on node docker-desktop fit failed: node(s) resource fit failed
I0722 20:42:56.021635       1 statement.go:353] Discarding operations ...
I0722 20:42:56.021648       1 allocate.go:158] Try to allocate resource to Jobs in Namespace <default> Queue <test1>
I0722 20:42:56.021656       1 allocate.go:164] Can not find jobs for queue test1.
I0722 20:42:56.021665       1 allocate.go:158] Try to allocate resource to Jobs in Namespace <default> Queue <test>
I0722 20:42:56.021670       1 priority.go:69] Priority JobOrderFn: <default/test-job1> priority: 0, <default/test-job> priority: 0
I0722 20:42:56.021677       1 gang.go:116] Gang JobOrderFn: <default/test-job1> is ready: true, <default/test-job> is ready: true
I0722 20:42:56.021682       1 drf.go:414] DRF JobOrderFn: <default/test-job1> share state: 0.4166666666666667, <default/test-job> share state: 0.4166666666666667
I0722 20:42:56.021699       1 allocate.go:192] Try to allocate resource to 0 tasks of Job <default/test-job2>
I0722 20:42:56.021731       1 statement.go:378] Committing operations ...
I0722 20:42:56.021745       1 allocate.go:158] Try to allocate resource to Jobs in Namespace <default> Queue <test>
I0722 20:42:56.021779       1 allocate.go:192] Try to allocate resource to 0 tasks of Job <default/test-job>
I0722 20:42:56.021787       1 statement.go:378] Committing operations ...
I0722 20:42:56.021794       1 allocate.go:158] Try to allocate resource to Jobs in Namespace <default> Queue <test>
I0722 20:42:56.021804       1 allocate.go:192] Try to allocate resource to 0 tasks of Job <default/test-job1>
I0722 20:42:56.021807       1 statement.go:378] Committing operations ...
I0722 20:42:56.021814       1 allocate.go:158] Try to allocate resource to Jobs in Namespace <default> Queue <test>
I0722 20:42:56.021823       1 allocate.go:164] Can not find jobs for queue test.
I0722 20:42:56.021839       1 allocate.go:154] Namespace <default> have no queue, skip it
I0722 20:42:56.021854       1 allocate.go:271] Leaving Allocate ...
I0722 20:42:56.021862       1 backfill.go:41] Enter Backfill ...
I0722 20:42:56.021869       1 job_info.go:555] job test-job/default actual: map[bash:1], ji.TaskMinAvailable: map[]
I0722 20:42:56.021882       1 job_info.go:555] job test-job2/default actual: map[bash:1], ji.TaskMinAvailable: map[]
I0722 20:42:56.021921       1 job_info.go:555] job test-job-new/default actual: map[bash-new:1], ji.TaskMinAvailable: map[]
I0722 20:42:56.021936       1 job_info.go:555] job test-job1/default actual: map[bash:1], ji.TaskMinAvailable: map[]
I0722 20:42:56.021976       1 backfill.go:91] Leaving Backfill ...
I0722 20:42:56.021988       1 reclaim.go:41] Enter Reclaim ...
I0722 20:42:56.021995       1 reclaim.go:50] There are <4> Jobs and <3> Queues in total for scheduling.
I0722 20:42:56.022003       1 job_info.go:555] job test-job/default actual: map[bash:1], ji.TaskMinAvailable: map[]
I0722 20:42:56.022029       1 reclaim.go:67] Added Queue <test> for Job <default/test-job>
I0722 20:42:56.022055       1 job_info.go:555] job test-job2/default actual: map[bash:1], ji.TaskMinAvailable: map[]
I0722 20:42:56.022095       1 job_info.go:555] job test-job-new/default actual: map[bash-new:1], ji.TaskMinAvailable: map[]
I0722 20:42:56.022109       1 reclaim.go:67] Added Queue <test1> for Job <default/test-job-new>
I0722 20:42:56.022120       1 job_info.go:555] job test-job1/default actual: map[bash:1], ji.TaskMinAvailable: map[]
I0722 20:42:56.022150       1 reclaim.go:124] Considering Task <default/test-job-new-bash-new-0> on Node <docker-desktop>.
I0722 20:42:56.022187       1 gang.go:93] Can not preempt task <default/test-job2-bash-0> because of gang-scheduling
I0722 20:42:56.022193       1 gang.go:93] Can not preempt task <default/test-job1-bash-0> because of gang-scheduling
I0722 20:42:56.022197       1 gang.go:93] Can not preempt task <default/test-job-bash-0> because of gang-scheduling
I0722 20:42:56.022202       1 gang.go:100] Victims from Gang plugins are []
I0722 20:42:56.022211       1 proportion.go:236] Victims from proportion plugins are []
I0722 20:42:56.022225       1 reclaim.go:148] No validated victims on Node <docker-desktop>: no victims
I0722 20:42:56.022242       1 reclaim.go:189] Leaving Reclaim ...
I0722 20:42:56.022478       1 cache.go:645] task unscheduleable default/test-job-new-bash-new-0, message: all nodes are unavailable: 1 node(s) resource fit failed., skip by no condition update
I0722 20:42:56.022520       1 session.go:170] Close Session 68f9ce6a-7517-4b0e-a8a5-ab3c1f4ff804
I0722 20:42:56.022530       1 scheduler.go:110] End scheduling ...

Please note below logs from above,

I0722 20:42:56.022187       1 gang.go:93] Can not preempt task <default/test-job2-bash-0> because of gang-scheduling
I0722 20:42:56.022193       1 gang.go:93] Can not preempt task <default/test-job1-bash-0> because of gang-scheduling
I0722 20:42:56.022197       1 gang.go:93] Can not preempt task <default/test-job-bash-0> because of gang-scheduling

Sharathmk99 avatar Jul 22 '21 20:07 Sharathmk99

@Sharathmk99 Thank you for append the scheduler log, please confirm if set PriorityClass for the volcano job. From the code logic if the preemptor priority is higher than preemptees's, the gang plugin will reject for job preemption.

https://github.com/volcano-sh/volcano/blob/44ec8eb28df26fbacd03ab15edabfc5916900c25/pkg/scheduler/plugins/gang/gang.go#L90-L94

wpeng102 avatar Jul 23 '21 00:07 wpeng102

@wpeng102 priorityClass is same for all volcano jobs.

But test queue is using resources more than deserved, a job from test queue should get evicted right?

Sharathmk99 avatar Jul 23 '21 07:07 Sharathmk99

@Sharathmk99 Thanks for reporting this issue, https://github.com/volcano-sh/volcano/issues/1642 should share the same root cause with this one.

For the workaround, maybe you can try exchage gang and proportion postion

actions: "enqueue, allocate, backfill, reclaim"
tiers:
- plugins:
  - name: priority
  - name: proportion
  - name: conformance
- plugins:
  - name: gang 
  - name: drf
  - name: predicates
  - name: nodeorder
  - name: binpack

wpeng102 avatar Jul 26 '21 01:07 wpeng102

@wpeng102 After exchanging gang and proportion position, still no luck,

Config,

actions: "enqueue, allocate, backfill, reclaim"
tiers:
- plugins:
  - name: priority
  - name: proportion
  - name: conformance
- plugins:
  - name: drf
  - name: predicates
  - name: gang
  - name: nodeorder
  - name: binpack
I0726 09:19:31.508938       1 session.go:151] Open Session d8e6b335-8c84-4d2a-928a-aeb658b8f925 with <4> Job and <3> Queues
I0726 09:19:31.509300       1 binpack.go:158] Enter binpack plugin ...
I0726 09:19:31.509319       1 binpack.go:177] resources [] record in weight but not found on any node
I0726 09:19:31.509325       1 binpack.go:161] Leaving binpack plugin. binpack.weight[1], binpack.cpu[1], binpack.memory[1], no extend resources. ...
I0726 09:19:31.509332       1 proportion.go:75] The total resource is <cpu 12000.00, memory 26644484096.00, hugepages-2Mi 0.00>
I0726 09:19:31.509341       1 proportion.go:79] Considering Job <default/test-job-new>.
I0726 09:19:31.509345       1 proportion.go:97] Added Queue <test1> attributes.
I0726 09:19:31.509349       1 proportion.go:79] Considering Job <default/test-job>.
I0726 09:19:31.509352       1 proportion.go:97] Added Queue <test> attributes.
I0726 09:19:31.509354       1 proportion.go:79] Considering Job <default/test-job1>.
I0726 09:19:31.509357       1 proportion.go:79] Considering Job <default/test-job2>.
I0726 09:19:31.509367       1 proportion.go:155] Considering Queue <test1>: weight <5>, total weight <10>.
I0726 09:19:31.509373       1 proportion.go:172] queue <test1> is meet
I0726 09:19:31.509377       1 proportion.go:179] The attributes of queue <test1> in proportion: deserved <cpu 500.00, memory 0.00>, allocate <cpu 0.00, memory 0.00>, request <cpu 500.00, memory 0.00>, share <0.00>
I0726 09:19:31.509384       1 proportion.go:155] Considering Queue <test>: weight <5>, total weight <10>.
I0726 09:19:31.509390       1 proportion.go:175] Format queue <test> deserved resource to <cpu 6000.00, memory 0.00, hugepages-2Mi 0.00>
I0726 09:19:31.509395       1 proportion.go:179] The attributes of queue <test> in proportion: deserved <cpu 6000.00, memory 0.00, hugepages-2Mi 0.00>, allocate <cpu 11000.00, memory 0.00>, request <cpu 11000.00, memory 0.00>, share <1.83>
I0726 09:19:31.509406       1 proportion.go:191] Remaining resource is  <cpu 5500.00, memory 26644484096.00, hugepages-2Mi 0.00>
I0726 09:19:31.509453       1 proportion.go:155] Considering Queue <test1>: weight <5>, total weight <5>.
I0726 09:19:31.509458       1 proportion.go:155] Considering Queue <test>: weight <5>, total weight <5>.
I0726 09:19:31.509485       1 proportion.go:172] queue <test> is meet
I0726 09:19:31.509490       1 proportion.go:179] The attributes of queue <test> in proportion: deserved <cpu 11000.00, memory 0.00>, allocate <cpu 11000.00, memory 0.00>, request <cpu 11000.00, memory 0.00>, share <1.00>
I0726 09:19:31.509500       1 proportion.go:191] Remaining resource is  <cpu 500.00, memory 26644484096.00, hugepages-2Mi 0.00>
I0726 09:19:31.509525       1 proportion.go:144] Exiting when total weight is 0
I0726 09:19:31.509533       1 drf.go:207] Total Allocatable cpu 12000.00, memory 26644484096.00, hugepages-2Mi 0.00
I0726 09:19:31.509549       1 enqueue.go:44] Enter Enqueue ...
I0726 09:19:31.509554       1 enqueue.go:62] Added Queue <test> for Job <default/test-job2>
I0726 09:19:31.509558       1 enqueue.go:62] Added Queue <test1> for Job <default/test-job-new>
I0726 09:19:31.509563       1 enqueue.go:78] Try to enqueue PodGroup to 0 Queues
I0726 09:19:31.509567       1 enqueue.go:102] Leaving Enqueue ...
I0726 09:19:31.509575       1 allocate.go:43] Enter Allocate ...
I0726 09:19:31.509580       1 job_info.go:555] job test-job2/default actual: map[bash:1], ji.TaskMinAvailable: map[]
I0726 09:19:31.509605       1 allocate.go:90] Added Job <default/test-job2> into Queue <test>
I0726 09:19:31.509611       1 job_info.go:555] job test-job-new/default actual: map[bash-new:1], ji.TaskMinAvailable: map[]
I0726 09:19:31.509616       1 allocate.go:90] Added Job <default/test-job-new> into Queue <test1>
I0726 09:19:31.509618       1 job_info.go:555] job test-job/default actual: map[bash:1], ji.TaskMinAvailable: map[]
I0726 09:19:31.509622       1 allocate.go:90] Added Job <default/test-job> into Queue <test>
I0726 09:19:31.509625       1 priority.go:69] Priority JobOrderFn: <default/test-job> priority: 0, <default/test-job2> priority: 0
I0726 09:19:31.509629       1 drf.go:414] DRF JobOrderFn: <default/test-job> share state: 0.4166666666666667, <default/test-job2> share state: 0.08333333333333333
I0726 09:19:31.509634       1 job_info.go:555] job test-job1/default actual: map[bash:1], ji.TaskMinAvailable: map[]
I0726 09:19:31.509641       1 allocate.go:90] Added Job <default/test-job1> into Queue <test>
I0726 09:19:31.509643       1 priority.go:69] Priority JobOrderFn: <default/test-job1> priority: 0, <default/test-job2> priority: 0
I0726 09:19:31.509647       1 drf.go:414] DRF JobOrderFn: <default/test-job1> share state: 0.4166666666666667, <default/test-job2> share state: 0.08333333333333333
I0726 09:19:31.509667       1 allocate.go:94] Try to allocate resource to 1 Namespaces
I0726 09:19:31.509672       1 allocate.go:109] unlockedNode ID: 62db948c-9907-4163-b4cc-a03f9741ea2d, Name: docker-desktop
I0726 09:19:31.509710       1 allocate.go:158] Try to allocate resource to Jobs in Namespace <default> Queue <test1>
I0726 09:19:31.509720       1 allocate.go:192] Try to allocate resource to 1 tasks of Job <default/test-job-new>
I0726 09:19:31.509764       1 allocate.go:200] There are <1> nodes for Job <default/test-job-new>
I0726 09:19:31.509797       1 scheduler_helper.go:98] Considering Task <default/test-job-new-bash-new-0> on node <docker-desktop>: <cpu 500.00, memory 0.00> vs. <cpu 150.00, memory 26392825856.00, hugepages-2Mi 0.00>
I0726 09:19:31.509832       1 scheduler_helper.go:103] Predicates failed for task <default/test-job-new-bash-new-0> on node <docker-desktop>: task default/test-job-new-bash-new-0 on node docker-desktop fit failed: node(s) resource fit failed
I0726 09:19:31.509877       1 statement.go:353] Discarding operations ...
I0726 09:19:31.509927       1 allocate.go:158] Try to allocate resource to Jobs in Namespace <default> Queue <test1>
I0726 09:19:31.509934       1 allocate.go:164] Can not find jobs for queue test1.
I0726 09:19:31.509939       1 allocate.go:158] Try to allocate resource to Jobs in Namespace <default> Queue <test>
I0726 09:19:31.509942       1 priority.go:69] Priority JobOrderFn: <default/test-job> priority: 0, <default/test-job1> priority: 0
I0726 09:19:31.509945       1 drf.go:414] DRF JobOrderFn: <default/test-job> share state: 0.4166666666666667, <default/test-job1> share state: 0.4166666666666667
I0726 09:19:31.509949       1 gang.go:116] Gang JobOrderFn: <default/test-job> is ready: true, <default/test-job1> is ready: true
I0726 09:19:31.509955       1 allocate.go:192] Try to allocate resource to 0 tasks of Job <default/test-job2>
I0726 09:19:31.509959       1 statement.go:378] Committing operations ...
I0726 09:19:31.509963       1 allocate.go:158] Try to allocate resource to Jobs in Namespace <default> Queue <test>
I0726 09:19:31.509968       1 allocate.go:192] Try to allocate resource to 0 tasks of Job <default/test-job>
I0726 09:19:31.509972       1 statement.go:378] Committing operations ...
I0726 09:19:31.509979       1 allocate.go:158] Try to allocate resource to Jobs in Namespace <default> Queue <test>
I0726 09:19:31.510001       1 allocate.go:192] Try to allocate resource to 0 tasks of Job <default/test-job1>
I0726 09:19:31.510005       1 statement.go:378] Committing operations ...
I0726 09:19:31.510010       1 allocate.go:158] Try to allocate resource to Jobs in Namespace <default> Queue <test>
I0726 09:19:31.510015       1 allocate.go:164] Can not find jobs for queue test.
I0726 09:19:31.510040       1 allocate.go:154] Namespace <default> have no queue, skip it
I0726 09:19:31.510047       1 allocate.go:271] Leaving Allocate ...
I0726 09:19:31.510072       1 backfill.go:41] Enter Backfill ...
I0726 09:19:31.510075       1 job_info.go:555] job test-job2/default actual: map[bash:1], ji.TaskMinAvailable: map[]
I0726 09:19:31.510108       1 job_info.go:555] job test-job-new/default actual: map[bash-new:1], ji.TaskMinAvailable: map[]
I0726 09:19:31.510112       1 job_info.go:555] job test-job/default actual: map[bash:1], ji.TaskMinAvailable: map[]
I0726 09:19:31.510116       1 job_info.go:555] job test-job1/default actual: map[bash:1], ji.TaskMinAvailable: map[]
I0726 09:19:31.510120       1 backfill.go:91] Leaving Backfill ...
I0726 09:19:31.510123       1 reclaim.go:41] Enter Reclaim ...
I0726 09:19:31.510126       1 reclaim.go:50] There are <4> Jobs and <3> Queues in total for scheduling.
I0726 09:19:31.510128       1 job_info.go:555] job test-job1/default actual: map[bash:1], ji.TaskMinAvailable: map[]
I0726 09:19:31.510132       1 reclaim.go:67] Added Queue <test> for Job <default/test-job1>
I0726 09:19:31.510138       1 job_info.go:555] job test-job2/default actual: map[bash:1], ji.TaskMinAvailable: map[]
I0726 09:19:31.510142       1 job_info.go:555] job test-job-new/default actual: map[bash-new:1], ji.TaskMinAvailable: map[]
I0726 09:19:31.510181       1 reclaim.go:67] Added Queue <test1> for Job <default/test-job-new>
I0726 09:19:31.510185       1 job_info.go:555] job test-job/default actual: map[bash:1], ji.TaskMinAvailable: map[]
I0726 09:19:31.510216       1 reclaim.go:124] Considering Task <default/test-job-new-bash-new-0> on Node <docker-desktop>.
I0726 09:19:31.510246       1 proportion.go:236] Victims from proportion plugins are []
I0726 09:19:31.510250       1 gang.go:93] Can not preempt task <default/test-job2-bash-0> because of gang-scheduling
I0726 09:19:31.510254       1 gang.go:93] Can not preempt task <default/test-job1-bash-0> because of gang-scheduling
I0726 09:19:31.510259       1 gang.go:93] Can not preempt task <default/test-job-bash-0> because of gang-scheduling
I0726 09:19:31.510263       1 gang.go:100] Victims from Gang plugins are []
I0726 09:19:31.510273       1 reclaim.go:148] No validated victims on Node <docker-desktop>: no victims
I0726 09:19:31.510348       1 reclaim.go:189] Leaving Reclaim ...
I0726 09:19:31.510507       1 cache.go:645] task unscheduleable default/test-job-new-bash-new-0, message: all nodes are unavailable: 1 node(s) resource fit failed., skip by no condition update
I0726 09:19:31.510641       1 session.go:170] Close Session d8e6b335-8c84-4d2a-928a-aeb658b8f925
I0726 09:19:31.510691       1 scheduler.go:110] End scheduling ...

Sharathmk99 avatar Jul 26 '21 09:07 Sharathmk99

@wpeng102 I did restart the deployment after configmap changes k rollout restart deployment -n volcano-system volcano-scheduler

Sharathmk99 avatar Jul 26 '21 10:07 Sharathmk99

@wpeng102 I tried to build docker image from master branch. But still the issues is same. Not able to reclaim guarantee resource for second queue. Is it possible to solve the above use case with Volcano?

Logs,

I0806 22:58:42.564073       1 scheduler.go:91] Start scheduling ...
I0806 22:58:42.564206       1 cache.go:840] The priority of job <default/test-job-new> is <high-pri/0>
I0806 22:58:42.564223       1 cache.go:840] The priority of job <default/test-job2> is </0>
I0806 22:58:42.564224       1 cache.go:840] The priority of job <default/test-job1> is </0>
I0806 22:58:42.564254       1 cache.go:840] The priority of job <default/test-job> is </0>
I0806 22:58:42.564305       1 cache.go:878] There are <4> Jobs, <3> Queues and <1> Nodes in total for scheduling.
I0806 22:58:42.564329       1 session.go:165] Open Session d1cb963c-8e81-4e8c-b5e5-e920535ee55f with <4> Job and <3> Queues
I0806 22:58:42.564345       1 proportion.go:73] The total resource is <cpu 12000.00, memory 26644484096.00, hugepages-2Mi 0.00>
I0806 22:58:42.564357       1 proportion.go:77] Considering Job <default/test-job2>.
I0806 22:58:42.564362       1 proportion.go:95] Added Queue <test> attributes.
I0806 22:58:42.564366       1 proportion.go:77] Considering Job <default/test-job1>.
I0806 22:58:42.564368       1 proportion.go:77] Considering Job <default/test-job>.
I0806 22:58:42.564371       1 proportion.go:77] Considering Job <default/test-job-new>.
I0806 22:58:42.564374       1 proportion.go:95] Added Queue <test1> attributes.
I0806 22:58:42.564385       1 proportion.go:153] Considering Queue <test>: weight <5>, total weight <10>.
I0806 22:58:42.564391       1 proportion.go:173] Format queue <test> deserved resource to <cpu 6000.00, memory 0.00, hugepages-2Mi 0.00>
I0806 22:58:42.564398       1 proportion.go:177] The attributes of queue <test> in proportion: deserved <cpu 6000.00, memory 0.00, hugepages-2Mi 0.00>, allocate <cpu 11000.00, memory 0.00>, request <cpu 11000.00, memory 0.00>, share <1.83>
I0806 22:58:42.564406       1 proportion.go:153] Considering Queue <test1>: weight <5>, total weight <10>.
I0806 22:58:42.564414       1 proportion.go:170] queue <test1> is meet
I0806 22:58:42.564418       1 proportion.go:177] The attributes of queue <test1> in proportion: deserved <cpu 500.00, memory 0.00>, allocate <cpu 0.00, memory 0.00>, request <cpu 500.00, memory 0.00>, share <0.00>
I0806 22:58:42.564442       1 proportion.go:189] Remaining resource is  <cpu 5500.00, memory 26644484096.00, hugepages-2Mi 0.00>
I0806 22:58:42.564457       1 proportion.go:153] Considering Queue <test>: weight <5>, total weight <5>.
I0806 22:58:42.564467       1 proportion.go:170] queue <test> is meet
I0806 22:58:42.564476       1 proportion.go:177] The attributes of queue <test> in proportion: deserved <cpu 11000.00, memory 0.00>, allocate <cpu 11000.00, memory 0.00>, request <cpu 11000.00, memory 0.00>, share <1.00>
I0806 22:58:42.564484       1 proportion.go:153] Considering Queue <test1>: weight <5>, total weight <5>.
I0806 22:58:42.564490       1 proportion.go:189] Remaining resource is  <cpu 500.00, memory 26644484096.00, hugepages-2Mi 0.00>
I0806 22:58:42.564514       1 proportion.go:142] Exiting when total weight is 0
I0806 22:58:42.564524       1 drf.go:206] Total Allocatable cpu 12000.00, memory 26644484096.00, hugepages-2Mi 0.00
I0806 22:58:42.564794       1 binpack.go:158] Enter binpack plugin ...
I0806 22:58:42.564814       1 binpack.go:177] resources [] record in weight but not found on any node
I0806 22:58:42.564820       1 binpack.go:161] Leaving binpack plugin. binpack.weight[1], binpack.cpu[1], binpack.memory[1], no extend resources. ...
I0806 22:58:42.564826       1 enqueue.go:44] Enter Enqueue ...
I0806 22:58:42.564830       1 enqueue.go:62] Added Queue <test1> for Job <default/test-job-new>
I0806 22:58:42.564834       1 enqueue.go:62] Added Queue <test> for Job <default/test-job2>
I0806 22:58:42.564838       1 enqueue.go:78] Try to enqueue PodGroup to 0 Queues
I0806 22:58:42.564842       1 enqueue.go:103] Leaving Enqueue ...
I0806 22:58:42.564846       1 allocate.go:43] Enter Allocate ...
I0806 22:58:42.564851       1 job_info.go:561] job test-job-new/default actual: map[bash-new:1], ji.TaskMinAvailable: map[bash-new:1]
I0806 22:58:42.564859       1 allocate.go:90] Added Job <default/test-job-new> into Queue <test1>
I0806 22:58:42.564862       1 job_info.go:561] job test-job2/default actual: map[bash:1], ji.TaskMinAvailable: map[bash:1]
I0806 22:58:42.564866       1 allocate.go:90] Added Job <default/test-job2> into Queue <test>
I0806 22:58:42.564869       1 job_info.go:561] job test-job1/default actual: map[bash:1], ji.TaskMinAvailable: map[bash:1]
I0806 22:58:42.564872       1 allocate.go:90] Added Job <default/test-job1> into Queue <test>
I0806 22:58:42.564875       1 priority.go:70] Priority JobOrderFn: <default/test-job1> priority: 0, <default/test-job2> priority: 0
I0806 22:58:42.564878       1 drf.go:413] DRF JobOrderFn: <default/test-job1> share state: 0.4166666666666667, <default/test-job2> share state: 0.08333333333333333
I0806 22:58:42.564882       1 job_info.go:561] job test-job/default actual: map[bash:1], ji.TaskMinAvailable: map[bash:1]
I0806 22:58:42.564906       1 allocate.go:90] Added Job <default/test-job> into Queue <test>
I0806 22:58:42.564916       1 priority.go:70] Priority JobOrderFn: <default/test-job> priority: 0, <default/test-job2> priority: 0
I0806 22:58:42.564934       1 drf.go:413] DRF JobOrderFn: <default/test-job> share state: 0.4166666666666667, <default/test-job2> share state: 0.08333333333333333
I0806 22:58:42.564955       1 allocate.go:94] Try to allocate resource to 1 Namespaces
I0806 22:58:42.564958       1 allocate.go:109] unlockedNode ID: 62db948c-9907-4163-b4cc-a03f9741ea2d, Name: docker-desktop
I0806 22:58:42.564987       1 allocate.go:162] Try to allocate resource to Jobs in Namespace <default> Queue <test1>
I0806 22:58:42.564994       1 allocate.go:196] Try to allocate resource to 1 tasks of Job <default/test-job-new>
I0806 22:58:42.564998       1 allocate.go:204] There are <1> nodes for Job <default/test-job-new>
I0806 22:58:42.565063       1 scheduler_helper.go:97] Considering Task <default/test-job-new-bash-new-0> on node <docker-desktop>: <cpu 500.00, memory 0.00> vs. <cpu 150.00, memory 26392825856.00, hugepages-2Mi 0.00>
I0806 22:58:42.565084       1 scheduler_helper.go:102] Predicates failed for task <default/test-job-new-bash-new-0> on node <docker-desktop>: task default/test-job-new-bash-new-0 on node docker-desktop fit failed: node(s) resource fit failed
I0806 22:58:42.565107       1 statement.go:351] Discarding operations ...
I0806 22:58:42.565115       1 allocate.go:162] Try to allocate resource to Jobs in Namespace <default> Queue <test>
I0806 22:58:42.565119       1 priority.go:70] Priority JobOrderFn: <default/test-job1> priority: 0, <default/test-job> priority: 0
I0806 22:58:42.565121       1 drf.go:413] DRF JobOrderFn: <default/test-job1> share state: 0.4166666666666667, <default/test-job> share state: 0.4166666666666667
I0806 22:58:42.565125       1 gang.go:118] Gang JobOrderFn: <default/test-job1> is ready: true, <default/test-job> is ready: true
I0806 22:58:42.565131       1 allocate.go:196] Try to allocate resource to 0 tasks of Job <default/test-job2>
I0806 22:58:42.565135       1 statement.go:376] Committing operations ...
I0806 22:58:42.565142       1 allocate.go:162] Try to allocate resource to Jobs in Namespace <default> Queue <test>
I0806 22:58:42.565148       1 allocate.go:196] Try to allocate resource to 0 tasks of Job <default/test-job>
I0806 22:58:42.565150       1 statement.go:376] Committing operations ...
I0806 22:58:42.565182       1 allocate.go:162] Try to allocate resource to Jobs in Namespace <default> Queue <test>
I0806 22:58:42.565226       1 allocate.go:196] Try to allocate resource to 0 tasks of Job <default/test-job1>
I0806 22:58:42.565232       1 statement.go:376] Committing operations ...
I0806 22:58:42.565248       1 allocate.go:158] Namespace <default> have no queue, skip it
I0806 22:58:42.565272       1 allocate.go:275] Leaving Allocate ...
I0806 22:58:42.565278       1 backfill.go:41] Enter Backfill ...
I0806 22:58:42.565281       1 job_info.go:561] job test-job/default actual: map[bash:1], ji.TaskMinAvailable: map[bash:1]
I0806 22:58:42.565287       1 job_info.go:561] job test-job-new/default actual: map[bash-new:1], ji.TaskMinAvailable: map[bash-new:1]
I0806 22:58:42.565294       1 job_info.go:561] job test-job2/default actual: map[bash:1], ji.TaskMinAvailable: map[bash:1]
I0806 22:58:42.565319       1 job_info.go:561] job test-job1/default actual: map[bash:1], ji.TaskMinAvailable: map[bash:1]
I0806 22:58:42.565348       1 backfill.go:90] Leaving Backfill ...
I0806 22:58:42.565352       1 reclaim.go:41] Enter Reclaim ...
I0806 22:58:42.565355       1 reclaim.go:50] There are <4> Jobs and <3> Queues in total for scheduling.
I0806 22:58:42.565359       1 job_info.go:561] job test-job-new/default actual: map[bash-new:1], ji.TaskMinAvailable: map[bash-new:1]
I0806 22:58:42.565364       1 reclaim.go:67] Added Queue <test1> for Job <default/test-job-new>
I0806 22:58:42.565367       1 job_info.go:561] job test-job2/default actual: map[bash:1], ji.TaskMinAvailable: map[bash:1]
I0806 22:58:42.565372       1 reclaim.go:67] Added Queue <test> for Job <default/test-job2>
I0806 22:58:42.565375       1 job_info.go:561] job test-job1/default actual: map[bash:1], ji.TaskMinAvailable: map[bash:1]
I0806 22:58:42.565379       1 job_info.go:561] job test-job/default actual: map[bash:1], ji.TaskMinAvailable: map[bash:1]
I0806 22:58:42.565393       1 reclaim.go:121] Considering Task <default/test-job-new-bash-new-0> on Node <docker-desktop>.
I0806 22:58:42.565429       1 proportion.go:234] Victims from proportion plugins are []
I0806 22:58:42.565434       1 gang.go:97] Can not preempt task <default/test-job1-bash-0> because job test-job1 ready num(1) <= MinAvailable(1) for gang-scheduling
I0806 22:58:42.565437       1 gang.go:97] Can not preempt task <default/test-job2-bash-0> because job test-job2 ready num(1) <= MinAvailable(1) for gang-scheduling
I0806 22:58:42.565460       1 gang.go:97] Can not preempt task <default/test-job-bash-0> because job test-job ready num(1) <= MinAvailable(1) for gang-scheduling
I0806 22:58:42.565463       1 gang.go:102] Victims from Gang plugins are []
I0806 22:58:42.565470       1 reclaim.go:145] No validated victims on Node <docker-desktop>: no victims
I0806 22:58:42.565479       1 reclaim.go:189] Leaving Reclaim ...
I0806 22:58:42.565582       1 cache.go:730] task unscheduleable default/test-job-new-bash-new-0, message: all nodes are unavailable: 1 node(s) resource fit failed., skip by no condition update
I0806 22:58:42.565653       1 session.go:187] Close Session d1cb963c-8e81-4e8c-b5e5-e920535ee55f
I0806 22:58:42.565661       1 scheduler.go:110] End scheduling ...

Sharathmk99 avatar Aug 06 '21 23:08 Sharathmk99

@Thor-wl @wpeng102 Can you help me to make fair sharing work? Is my Kubernetes version 1.21 a problem?

Client Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.0", GitCommit:"e19964183377d0ec2052d1f1fa930c4d7575bd50", GitTreeState:"clean", BuildDate:"2020-08-26T14:30:33Z", GoVersion:"go1.15", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.1", GitCommit:"5e58841cce77d4bc13713ad2b91fa0d961e69192", GitTreeState:"clean", BuildDate:"2021-05-12T14:12:29Z", GoVersion:"go1.16.4", Compiler:"gc", Platform:"linux/amd64"}

Sharathmk99 avatar Aug 12 '21 09:08 Sharathmk99

I0726 09:19:31.510263 1 gang.go:100] Victims from Gang plugins are []

hi @Sharathmk99 , for volcano v1.3.0, the most important log pasted l found was:

I0726 09:19:31.510246       1 proportion.go:236] Victims from proportion plugins are []

related logic is:

for _, reclaimee := range reclaimees {
  ...
  allocated.Sub(reclaimee.Resreq)
  if attr.deserved.LessEqualStrict(allocated) {
    victims = append(victims, reclaimee)
  }
}
klog.V(4).Infof("Victims from proportion plugins are %+v", victims)

for your queue test, l found this log:

Queue <test>: deserved <cpu 10000.00, memory 0.00, hugepages-2Mi 0.00>, allocated <cpu 11000.00, memory 0.00>, share <1.1>

so the situation may be that allocated first subtracts a value, which leads to deserved > allocated, and eventually causes skippe the calculation of vicitim. It was a bug in v1.3.0, related fixes is https://github.com/volcano-sh/volcano/pull/1540

shinytang6 avatar Aug 12 '21 10:08 shinytang6

@shinytang6 Thank you for the response. I did tried to build docker image from master branch and test it, but still it doesn't work.

Ideally test queue should deserved 5000 and test1 should deserved 5000 and default queue should deserved 1000. Do i need to share any other log?

Below is the log from master branch docker images,

I0806 22:58:42.564073       1 scheduler.go:91] Start scheduling ...
I0806 22:58:42.564206       1 cache.go:840] The priority of job <default/test-job-new> is <high-pri/0>
I0806 22:58:42.564223       1 cache.go:840] The priority of job <default/test-job2> is </0>
I0806 22:58:42.564224       1 cache.go:840] The priority of job <default/test-job1> is </0>
I0806 22:58:42.564254       1 cache.go:840] The priority of job <default/test-job> is </0>
I0806 22:58:42.564305       1 cache.go:878] There are <4> Jobs, <3> Queues and <1> Nodes in total for scheduling.
I0806 22:58:42.564329       1 session.go:165] Open Session d1cb963c-8e81-4e8c-b5e5-e920535ee55f with <4> Job and <3> Queues
I0806 22:58:42.564345       1 proportion.go:73] The total resource is <cpu 12000.00, memory 26644484096.00, hugepages-2Mi 0.00>
I0806 22:58:42.564357       1 proportion.go:77] Considering Job <default/test-job2>.
I0806 22:58:42.564362       1 proportion.go:95] Added Queue <test> attributes.
I0806 22:58:42.564366       1 proportion.go:77] Considering Job <default/test-job1>.
I0806 22:58:42.564368       1 proportion.go:77] Considering Job <default/test-job>.
I0806 22:58:42.564371       1 proportion.go:77] Considering Job <default/test-job-new>.
I0806 22:58:42.564374       1 proportion.go:95] Added Queue <test1> attributes.
I0806 22:58:42.564385       1 proportion.go:153] Considering Queue <test>: weight <5>, total weight <10>.
I0806 22:58:42.564391       1 proportion.go:173] Format queue <test> deserved resource to <cpu 6000.00, memory 0.00, hugepages-2Mi 0.00>
I0806 22:58:42.564398       1 proportion.go:177] The attributes of queue <test> in proportion: deserved <cpu 6000.00, memory 0.00, hugepages-2Mi 0.00>, allocate <cpu 11000.00, memory 0.00>, request <cpu 11000.00, memory 0.00>, share <1.83>
I0806 22:58:42.564406       1 proportion.go:153] Considering Queue <test1>: weight <5>, total weight <10>.
I0806 22:58:42.564414       1 proportion.go:170] queue <test1> is meet
I0806 22:58:42.564418       1 proportion.go:177] The attributes of queue <test1> in proportion: deserved <cpu 500.00, memory 0.00>, allocate <cpu 0.00, memory 0.00>, request <cpu 500.00, memory 0.00>, share <0.00>
I0806 22:58:42.564442       1 proportion.go:189] Remaining resource is  <cpu 5500.00, memory 26644484096.00, hugepages-2Mi 0.00>
I0806 22:58:42.564457       1 proportion.go:153] Considering Queue <test>: weight <5>, total weight <5>.
I0806 22:58:42.564467       1 proportion.go:170] queue <test> is meet
I0806 22:58:42.564476       1 proportion.go:177] The attributes of queue <test> in proportion: deserved <cpu 11000.00, memory 0.00>, allocate <cpu 11000.00, memory 0.00>, request <cpu 11000.00, memory 0.00>, share <1.00>
I0806 22:58:42.564484       1 proportion.go:153] Considering Queue <test1>: weight <5>, total weight <5>.
I0806 22:58:42.564490       1 proportion.go:189] Remaining resource is  <cpu 500.00, memory 26644484096.00, hugepages-2Mi 0.00>
I0806 22:58:42.564514       1 proportion.go:142] Exiting when total weight is 0
I0806 22:58:42.564524       1 drf.go:206] Total Allocatable cpu 12000.00, memory 26644484096.00, hugepages-2Mi 0.00
I0806 22:58:42.564794       1 binpack.go:158] Enter binpack plugin ...
I0806 22:58:42.564814       1 binpack.go:177] resources [] record in weight but not found on any node
I0806 22:58:42.564820       1 binpack.go:161] Leaving binpack plugin. binpack.weight[1], binpack.cpu[1], binpack.memory[1], no extend resources. ...
I0806 22:58:42.564826       1 enqueue.go:44] Enter Enqueue ...
I0806 22:58:42.564830       1 enqueue.go:62] Added Queue <test1> for Job <default/test-job-new>
I0806 22:58:42.564834       1 enqueue.go:62] Added Queue <test> for Job <default/test-job2>
I0806 22:58:42.564838       1 enqueue.go:78] Try to enqueue PodGroup to 0 Queues
I0806 22:58:42.564842       1 enqueue.go:103] Leaving Enqueue ...
I0806 22:58:42.564846       1 allocate.go:43] Enter Allocate ...
I0806 22:58:42.564851       1 job_info.go:561] job test-job-new/default actual: map[bash-new:1], ji.TaskMinAvailable: map[bash-new:1]
I0806 22:58:42.564859       1 allocate.go:90] Added Job <default/test-job-new> into Queue <test1>
I0806 22:58:42.564862       1 job_info.go:561] job test-job2/default actual: map[bash:1], ji.TaskMinAvailable: map[bash:1]
I0806 22:58:42.564866       1 allocate.go:90] Added Job <default/test-job2> into Queue <test>
I0806 22:58:42.564869       1 job_info.go:561] job test-job1/default actual: map[bash:1], ji.TaskMinAvailable: map[bash:1]
I0806 22:58:42.564872       1 allocate.go:90] Added Job <default/test-job1> into Queue <test>
I0806 22:58:42.564875       1 priority.go:70] Priority JobOrderFn: <default/test-job1> priority: 0, <default/test-job2> priority: 0
I0806 22:58:42.564878       1 drf.go:413] DRF JobOrderFn: <default/test-job1> share state: 0.4166666666666667, <default/test-job2> share state: 0.08333333333333333
I0806 22:58:42.564882       1 job_info.go:561] job test-job/default actual: map[bash:1], ji.TaskMinAvailable: map[bash:1]
I0806 22:58:42.564906       1 allocate.go:90] Added Job <default/test-job> into Queue <test>
I0806 22:58:42.564916       1 priority.go:70] Priority JobOrderFn: <default/test-job> priority: 0, <default/test-job2> priority: 0
I0806 22:58:42.564934       1 drf.go:413] DRF JobOrderFn: <default/test-job> share state: 0.4166666666666667, <default/test-job2> share state: 0.08333333333333333
I0806 22:58:42.564955       1 allocate.go:94] Try to allocate resource to 1 Namespaces
I0806 22:58:42.564958       1 allocate.go:109] unlockedNode ID: 62db948c-9907-4163-b4cc-a03f9741ea2d, Name: docker-desktop
I0806 22:58:42.564987       1 allocate.go:162] Try to allocate resource to Jobs in Namespace <default> Queue <test1>
I0806 22:58:42.564994       1 allocate.go:196] Try to allocate resource to 1 tasks of Job <default/test-job-new>
I0806 22:58:42.564998       1 allocate.go:204] There are <1> nodes for Job <default/test-job-new>
I0806 22:58:42.565063       1 scheduler_helper.go:97] Considering Task <default/test-job-new-bash-new-0> on node <docker-desktop>: <cpu 500.00, memory 0.00> vs. <cpu 150.00, memory 26392825856.00, hugepages-2Mi 0.00>
I0806 22:58:42.565084       1 scheduler_helper.go:102] Predicates failed for task <default/test-job-new-bash-new-0> on node <docker-desktop>: task default/test-job-new-bash-new-0 on node docker-desktop fit failed: node(s) resource fit failed
I0806 22:58:42.565107       1 statement.go:351] Discarding operations ...
I0806 22:58:42.565115       1 allocate.go:162] Try to allocate resource to Jobs in Namespace <default> Queue <test>
I0806 22:58:42.565119       1 priority.go:70] Priority JobOrderFn: <default/test-job1> priority: 0, <default/test-job> priority: 0
I0806 22:58:42.565121       1 drf.go:413] DRF JobOrderFn: <default/test-job1> share state: 0.4166666666666667, <default/test-job> share state: 0.4166666666666667
I0806 22:58:42.565125       1 gang.go:118] Gang JobOrderFn: <default/test-job1> is ready: true, <default/test-job> is ready: true
I0806 22:58:42.565131       1 allocate.go:196] Try to allocate resource to 0 tasks of Job <default/test-job2>
I0806 22:58:42.565135       1 statement.go:376] Committing operations ...
I0806 22:58:42.565142       1 allocate.go:162] Try to allocate resource to Jobs in Namespace <default> Queue <test>
I0806 22:58:42.565148       1 allocate.go:196] Try to allocate resource to 0 tasks of Job <default/test-job>
I0806 22:58:42.565150       1 statement.go:376] Committing operations ...
I0806 22:58:42.565182       1 allocate.go:162] Try to allocate resource to Jobs in Namespace <default> Queue <test>
I0806 22:58:42.565226       1 allocate.go:196] Try to allocate resource to 0 tasks of Job <default/test-job1>
I0806 22:58:42.565232       1 statement.go:376] Committing operations ...
I0806 22:58:42.565248       1 allocate.go:158] Namespace <default> have no queue, skip it
I0806 22:58:42.565272       1 allocate.go:275] Leaving Allocate ...
I0806 22:58:42.565278       1 backfill.go:41] Enter Backfill ...
I0806 22:58:42.565281       1 job_info.go:561] job test-job/default actual: map[bash:1], ji.TaskMinAvailable: map[bash:1]
I0806 22:58:42.565287       1 job_info.go:561] job test-job-new/default actual: map[bash-new:1], ji.TaskMinAvailable: map[bash-new:1]
I0806 22:58:42.565294       1 job_info.go:561] job test-job2/default actual: map[bash:1], ji.TaskMinAvailable: map[bash:1]
I0806 22:58:42.565319       1 job_info.go:561] job test-job1/default actual: map[bash:1], ji.TaskMinAvailable: map[bash:1]
I0806 22:58:42.565348       1 backfill.go:90] Leaving Backfill ...
I0806 22:58:42.565352       1 reclaim.go:41] Enter Reclaim ...
I0806 22:58:42.565355       1 reclaim.go:50] There are <4> Jobs and <3> Queues in total for scheduling.
I0806 22:58:42.565359       1 job_info.go:561] job test-job-new/default actual: map[bash-new:1], ji.TaskMinAvailable: map[bash-new:1]
I0806 22:58:42.565364       1 reclaim.go:67] Added Queue <test1> for Job <default/test-job-new>
I0806 22:58:42.565367       1 job_info.go:561] job test-job2/default actual: map[bash:1], ji.TaskMinAvailable: map[bash:1]
I0806 22:58:42.565372       1 reclaim.go:67] Added Queue <test> for Job <default/test-job2>
I0806 22:58:42.565375       1 job_info.go:561] job test-job1/default actual: map[bash:1], ji.TaskMinAvailable: map[bash:1]
I0806 22:58:42.565379       1 job_info.go:561] job test-job/default actual: map[bash:1], ji.TaskMinAvailable: map[bash:1]
I0806 22:58:42.565393       1 reclaim.go:121] Considering Task <default/test-job-new-bash-new-0> on Node <docker-desktop>.
I0806 22:58:42.565429       1 proportion.go:234] Victims from proportion plugins are []
I0806 22:58:42.565434       1 gang.go:97] Can not preempt task <default/test-job1-bash-0> because job test-job1 ready num(1) <= MinAvailable(1) for gang-scheduling
I0806 22:58:42.565437       1 gang.go:97] Can not preempt task <default/test-job2-bash-0> because job test-job2 ready num(1) <= MinAvailable(1) for gang-scheduling
I0806 22:58:42.565460       1 gang.go:97] Can not preempt task <default/test-job-bash-0> because job test-job ready num(1) <= MinAvailable(1) for gang-scheduling
I0806 22:58:42.565463       1 gang.go:102] Victims from Gang plugins are []
I0806 22:58:42.565470       1 reclaim.go:145] No validated victims on Node <docker-desktop>: no victims
I0806 22:58:42.565479       1 reclaim.go:189] Leaving Reclaim ...
I0806 22:58:42.565582       1 cache.go:730] task unscheduleable default/test-job-new-bash-new-0, message: all nodes are unavailable: 1 node(s) resource fit failed., skip by no condition update
I0806 22:58:42.565653       1 session.go:187] Close Session d1cb963c-8e81-4e8c-b5e5-e920535ee55f
I0806 22:58:42.565661       1 scheduler.go:110] End scheduling ...

Sharathmk99 avatar Aug 12 '21 10:08 Sharathmk99

@shinytang6 Thank you for the response. I did tried to build docker image from master branch and test it, but still it doesn't work.

Ideally test queue should deserved 5000 and test1 should deserved 5000 and default queue should deserved 1000. Do i need to share any other log?

Below is the log from master branch docker images,

I0806 22:58:42.564073       1 scheduler.go:91] Start scheduling ...
I0806 22:58:42.564206       1 cache.go:840] The priority of job <default/test-job-new> is <high-pri/0>
I0806 22:58:42.564223       1 cache.go:840] The priority of job <default/test-job2> is </0>
I0806 22:58:42.564224       1 cache.go:840] The priority of job <default/test-job1> is </0>
I0806 22:58:42.564254       1 cache.go:840] The priority of job <default/test-job> is </0>
I0806 22:58:42.564305       1 cache.go:878] There are <4> Jobs, <3> Queues and <1> Nodes in total for scheduling.
I0806 22:58:42.564329       1 session.go:165] Open Session d1cb963c-8e81-4e8c-b5e5-e920535ee55f with <4> Job and <3> Queues
I0806 22:58:42.564345       1 proportion.go:73] The total resource is <cpu 12000.00, memory 26644484096.00, hugepages-2Mi 0.00>
I0806 22:58:42.564357       1 proportion.go:77] Considering Job <default/test-job2>.
I0806 22:58:42.564362       1 proportion.go:95] Added Queue <test> attributes.
I0806 22:58:42.564366       1 proportion.go:77] Considering Job <default/test-job1>.
I0806 22:58:42.564368       1 proportion.go:77] Considering Job <default/test-job>.
I0806 22:58:42.564371       1 proportion.go:77] Considering Job <default/test-job-new>.
I0806 22:58:42.564374       1 proportion.go:95] Added Queue <test1> attributes.
I0806 22:58:42.564385       1 proportion.go:153] Considering Queue <test>: weight <5>, total weight <10>.
I0806 22:58:42.564391       1 proportion.go:173] Format queue <test> deserved resource to <cpu 6000.00, memory 0.00, hugepages-2Mi 0.00>
I0806 22:58:42.564398       1 proportion.go:177] The attributes of queue <test> in proportion: deserved <cpu 6000.00, memory 0.00, hugepages-2Mi 0.00>, allocate <cpu 11000.00, memory 0.00>, request <cpu 11000.00, memory 0.00>, share <1.83>
I0806 22:58:42.564406       1 proportion.go:153] Considering Queue <test1>: weight <5>, total weight <10>.
I0806 22:58:42.564414       1 proportion.go:170] queue <test1> is meet
I0806 22:58:42.564418       1 proportion.go:177] The attributes of queue <test1> in proportion: deserved <cpu 500.00, memory 0.00>, allocate <cpu 0.00, memory 0.00>, request <cpu 500.00, memory 0.00>, share <0.00>
I0806 22:58:42.564442       1 proportion.go:189] Remaining resource is  <cpu 5500.00, memory 26644484096.00, hugepages-2Mi 0.00>
I0806 22:58:42.564457       1 proportion.go:153] Considering Queue <test>: weight <5>, total weight <5>.
I0806 22:58:42.564467       1 proportion.go:170] queue <test> is meet
I0806 22:58:42.564476       1 proportion.go:177] The attributes of queue <test> in proportion: deserved <cpu 11000.00, memory 0.00>, allocate <cpu 11000.00, memory 0.00>, request <cpu 11000.00, memory 0.00>, share <1.00>
I0806 22:58:42.564484       1 proportion.go:153] Considering Queue <test1>: weight <5>, total weight <5>.
I0806 22:58:42.564490       1 proportion.go:189] Remaining resource is  <cpu 500.00, memory 26644484096.00, hugepages-2Mi 0.00>
I0806 22:58:42.564514       1 proportion.go:142] Exiting when total weight is 0
I0806 22:58:42.564524       1 drf.go:206] Total Allocatable cpu 12000.00, memory 26644484096.00, hugepages-2Mi 0.00
I0806 22:58:42.564794       1 binpack.go:158] Enter binpack plugin ...
I0806 22:58:42.564814       1 binpack.go:177] resources [] record in weight but not found on any node
I0806 22:58:42.564820       1 binpack.go:161] Leaving binpack plugin. binpack.weight[1], binpack.cpu[1], binpack.memory[1], no extend resources. ...
I0806 22:58:42.564826       1 enqueue.go:44] Enter Enqueue ...
I0806 22:58:42.564830       1 enqueue.go:62] Added Queue <test1> for Job <default/test-job-new>
I0806 22:58:42.564834       1 enqueue.go:62] Added Queue <test> for Job <default/test-job2>
I0806 22:58:42.564838       1 enqueue.go:78] Try to enqueue PodGroup to 0 Queues
I0806 22:58:42.564842       1 enqueue.go:103] Leaving Enqueue ...
I0806 22:58:42.564846       1 allocate.go:43] Enter Allocate ...
I0806 22:58:42.564851       1 job_info.go:561] job test-job-new/default actual: map[bash-new:1], ji.TaskMinAvailable: map[bash-new:1]
I0806 22:58:42.564859       1 allocate.go:90] Added Job <default/test-job-new> into Queue <test1>
I0806 22:58:42.564862       1 job_info.go:561] job test-job2/default actual: map[bash:1], ji.TaskMinAvailable: map[bash:1]
I0806 22:58:42.564866       1 allocate.go:90] Added Job <default/test-job2> into Queue <test>
I0806 22:58:42.564869       1 job_info.go:561] job test-job1/default actual: map[bash:1], ji.TaskMinAvailable: map[bash:1]
I0806 22:58:42.564872       1 allocate.go:90] Added Job <default/test-job1> into Queue <test>
I0806 22:58:42.564875       1 priority.go:70] Priority JobOrderFn: <default/test-job1> priority: 0, <default/test-job2> priority: 0
I0806 22:58:42.564878       1 drf.go:413] DRF JobOrderFn: <default/test-job1> share state: 0.4166666666666667, <default/test-job2> share state: 0.08333333333333333
I0806 22:58:42.564882       1 job_info.go:561] job test-job/default actual: map[bash:1], ji.TaskMinAvailable: map[bash:1]
I0806 22:58:42.564906       1 allocate.go:90] Added Job <default/test-job> into Queue <test>
I0806 22:58:42.564916       1 priority.go:70] Priority JobOrderFn: <default/test-job> priority: 0, <default/test-job2> priority: 0
I0806 22:58:42.564934       1 drf.go:413] DRF JobOrderFn: <default/test-job> share state: 0.4166666666666667, <default/test-job2> share state: 0.08333333333333333
I0806 22:58:42.564955       1 allocate.go:94] Try to allocate resource to 1 Namespaces
I0806 22:58:42.564958       1 allocate.go:109] unlockedNode ID: 62db948c-9907-4163-b4cc-a03f9741ea2d, Name: docker-desktop
I0806 22:58:42.564987       1 allocate.go:162] Try to allocate resource to Jobs in Namespace <default> Queue <test1>
I0806 22:58:42.564994       1 allocate.go:196] Try to allocate resource to 1 tasks of Job <default/test-job-new>
I0806 22:58:42.564998       1 allocate.go:204] There are <1> nodes for Job <default/test-job-new>
I0806 22:58:42.565063       1 scheduler_helper.go:97] Considering Task <default/test-job-new-bash-new-0> on node <docker-desktop>: <cpu 500.00, memory 0.00> vs. <cpu 150.00, memory 26392825856.00, hugepages-2Mi 0.00>
I0806 22:58:42.565084       1 scheduler_helper.go:102] Predicates failed for task <default/test-job-new-bash-new-0> on node <docker-desktop>: task default/test-job-new-bash-new-0 on node docker-desktop fit failed: node(s) resource fit failed
I0806 22:58:42.565107       1 statement.go:351] Discarding operations ...
I0806 22:58:42.565115       1 allocate.go:162] Try to allocate resource to Jobs in Namespace <default> Queue <test>
I0806 22:58:42.565119       1 priority.go:70] Priority JobOrderFn: <default/test-job1> priority: 0, <default/test-job> priority: 0
I0806 22:58:42.565121       1 drf.go:413] DRF JobOrderFn: <default/test-job1> share state: 0.4166666666666667, <default/test-job> share state: 0.4166666666666667
I0806 22:58:42.565125       1 gang.go:118] Gang JobOrderFn: <default/test-job1> is ready: true, <default/test-job> is ready: true
I0806 22:58:42.565131       1 allocate.go:196] Try to allocate resource to 0 tasks of Job <default/test-job2>
I0806 22:58:42.565135       1 statement.go:376] Committing operations ...
I0806 22:58:42.565142       1 allocate.go:162] Try to allocate resource to Jobs in Namespace <default> Queue <test>
I0806 22:58:42.565148       1 allocate.go:196] Try to allocate resource to 0 tasks of Job <default/test-job>
I0806 22:58:42.565150       1 statement.go:376] Committing operations ...
I0806 22:58:42.565182       1 allocate.go:162] Try to allocate resource to Jobs in Namespace <default> Queue <test>
I0806 22:58:42.565226       1 allocate.go:196] Try to allocate resource to 0 tasks of Job <default/test-job1>
I0806 22:58:42.565232       1 statement.go:376] Committing operations ...
I0806 22:58:42.565248       1 allocate.go:158] Namespace <default> have no queue, skip it
I0806 22:58:42.565272       1 allocate.go:275] Leaving Allocate ...
I0806 22:58:42.565278       1 backfill.go:41] Enter Backfill ...
I0806 22:58:42.565281       1 job_info.go:561] job test-job/default actual: map[bash:1], ji.TaskMinAvailable: map[bash:1]
I0806 22:58:42.565287       1 job_info.go:561] job test-job-new/default actual: map[bash-new:1], ji.TaskMinAvailable: map[bash-new:1]
I0806 22:58:42.565294       1 job_info.go:561] job test-job2/default actual: map[bash:1], ji.TaskMinAvailable: map[bash:1]
I0806 22:58:42.565319       1 job_info.go:561] job test-job1/default actual: map[bash:1], ji.TaskMinAvailable: map[bash:1]
I0806 22:58:42.565348       1 backfill.go:90] Leaving Backfill ...
I0806 22:58:42.565352       1 reclaim.go:41] Enter Reclaim ...
I0806 22:58:42.565355       1 reclaim.go:50] There are <4> Jobs and <3> Queues in total for scheduling.
I0806 22:58:42.565359       1 job_info.go:561] job test-job-new/default actual: map[bash-new:1], ji.TaskMinAvailable: map[bash-new:1]
I0806 22:58:42.565364       1 reclaim.go:67] Added Queue <test1> for Job <default/test-job-new>
I0806 22:58:42.565367       1 job_info.go:561] job test-job2/default actual: map[bash:1], ji.TaskMinAvailable: map[bash:1]
I0806 22:58:42.565372       1 reclaim.go:67] Added Queue <test> for Job <default/test-job2>
I0806 22:58:42.565375       1 job_info.go:561] job test-job1/default actual: map[bash:1], ji.TaskMinAvailable: map[bash:1]
I0806 22:58:42.565379       1 job_info.go:561] job test-job/default actual: map[bash:1], ji.TaskMinAvailable: map[bash:1]
I0806 22:58:42.565393       1 reclaim.go:121] Considering Task <default/test-job-new-bash-new-0> on Node <docker-desktop>.
I0806 22:58:42.565429       1 proportion.go:234] Victims from proportion plugins are []
I0806 22:58:42.565434       1 gang.go:97] Can not preempt task <default/test-job1-bash-0> because job test-job1 ready num(1) <= MinAvailable(1) for gang-scheduling
I0806 22:58:42.565437       1 gang.go:97] Can not preempt task <default/test-job2-bash-0> because job test-job2 ready num(1) <= MinAvailable(1) for gang-scheduling
I0806 22:58:42.565460       1 gang.go:97] Can not preempt task <default/test-job-bash-0> because job test-job ready num(1) <= MinAvailable(1) for gang-scheduling
I0806 22:58:42.565463       1 gang.go:102] Victims from Gang plugins are []
I0806 22:58:42.565470       1 reclaim.go:145] No validated victims on Node <docker-desktop>: no victims
I0806 22:58:42.565479       1 reclaim.go:189] Leaving Reclaim ...
I0806 22:58:42.565582       1 cache.go:730] task unscheduleable default/test-job-new-bash-new-0, message: all nodes are unavailable: 1 node(s) resource fit failed., skip by no condition update
I0806 22:58:42.565653       1 session.go:187] Close Session d1cb963c-8e81-4e8c-b5e5-e920535ee55f
I0806 22:58:42.565661       1 scheduler.go:110] End scheduling ...

l will take a look for that, my intuition is that there are still some potential bugs in proportion plugin..

shinytang6 avatar Aug 12 '21 11:08 shinytang6

Hello 👋 Looks like there was no activity on this issue for last 90 days. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity for 60 days, this issue will be closed (we can always reopen an issue if we need!).

stale[bot] avatar Nov 10 '21 20:11 stale[bot]

Hello 👋 Looks like there was no activity on this issue for last 90 days. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity for 60 days, this issue will be closed (we can always reopen an issue if we need!).

stale[bot] avatar Feb 17 '22 05:02 stale[bot]

Closing for now as there was no activity for last 60 days after marked as stale, let us know if you need this to be reopened! 🤗

stale[bot] avatar Apr 24 '22 03:04 stale[bot]

Hello 👋 Looks like there was no activity on this issue for last 90 days. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity for 60 days, this issue will be closed (we can always reopen an issue if we need!).

stale[bot] avatar Jul 30 '22 18:07 stale[bot]

Closing for now as there was no activity for last 60 days after marked as stale, let us know if you need this to be reopened! 🤗

stale[bot] avatar Oct 01 '22 00:10 stale[bot]