volcano 后执行且低优先级的多个小job总是比先执行高优先级的大job先运行

What happened: 集群中gpu可用显存资源在某个时刻不能满足（已有一个大job在运行）1个大job运行，但能满足多个小job运行；此时后执行且低优先级的多个小job总是比先执行高优先级的大job先运行。

What you expected to happen: 先执行且高优先级的大job比后执行低优先级的小job要先达到Running状态，后执行且低优先级小job不应该插队运行。

How to reproduce it (as minimally and precisely as possible): 说明：优先级p1(1100000)<p2(1200000)

apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
  name: gpu
spec:
  weight: 1
  reclaimable: true
  capability:
    volcano.sh/gpu-memory: 6144
---
apiVersion: v1
data:
  volcano-scheduler.conf: |
    actions: "enqueue, allocate, preempt, reclaim, backfill"
    tiers:
    - plugins:
      - name: priority
      - name: gang
      - name: conformance
    - plugins:
      - name: drf
      - name: predicates
        arguments:
          predicate.GPUSharingEnable: true
      - name: proportion
      - name: nodeorder
kind: ConfigMap
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"v1","data":{"volcano-scheduler.conf":"actions: \"enqueue, allocate, backfill\"\ntiers:\n- plugins:\n  - name: priority\n  - name: gang\n    enablePreemptable: false\n  - name: conformance\n- plugins:\n  - name: overcommit\n  - name: drf\n    enablePreemptable: false\n  - name: predicates\n    arguments:\n      predicate.GPUSharingEnable: true\n  - name: proportion\n  - name: nodeorder\n  - name: binpack\n"},"kind":"ConfigMap","metadata":{"annotations":{},"name":"volcano-scheduler-configmap","namespace":"kubeflow"}}
  creationTimestamp: "2023-08-25T00:38:53Z"
  name: volcano-scheduler-configmap
  namespace: kubeflow
  resourceVersion: "5072688"
  uid: 72433a1a-d406-42d4-957d-bd33fe129087
---
# job1 execute first 
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: sc2-1
spec:
  minAvailable: 1
  priorityClassName: p1
  schedulerName: volcano
  queue: gpu
  tasks:
    - replicas: 1
      name: "test"
      template:
        metadata:
          name: web
        spec:
          nodeSelector:
            kubernetes.io/hostname: k8s02
          containers:
            - image: registry.demo.com/cube-studio/gpu-player:v2  
              command: ["sh", "-c", "sleep 90"]
              imagePullPolicy: IfNotPresent
              name: nginx
              resources:
                limits:
                  volcano.sh/gpu-memory: 4096
          restartPolicy: OnFailure

# job2 run after 5s of sleep（After job1）
---
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: sc2-2
spec:
  #minAvailable: 1
  priorityClassName: p2
  schedulerName: volcano
  queue: gpu
  tasks:
    - replicas: 1
      name: "test"
      template:
        metadata:
          name: web
        spec:
          nodeSelector:
            kubernetes.io/hostname: k8s02
          containers:
            - image: registry.demo.com/cube-studio/gpu-player:v2  
              command: ["sh", "-c", "sleep 60"]
              imagePullPolicy: IfNotPresent
              name: nginx
              resources:
                limits:
                  volcano.sh/gpu-memory: 5000
          restartPolicy: OnFailure

# job3 run after 5s of sleep（After job2）
---
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: sc2-3
spec:
  minAvailable: 1
  priorityClassName: p1
  schedulerName: volcano
  queue: gpu
  tasks:
    - replicas: 1
      name: "test"
      template:
        metadata:
          name: web
        spec:
          nodeSelector:
            kubernetes.io/hostname: k8s02
          containers:
            - image: registry.demo.com/cube-studio/gpu-player:v2  
              command: ["sh", "-c", "sleep 90"]
              imagePullPolicy: IfNotPresent
              name: nginx
              resources:
                limits:
                  volcano.sh/gpu-memory: 1500
          restartPolicy: OnFailure

# job4 run after 5s of sleep（After job3）
---
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: sc2-4
spec:
  minAvailable: 1
  priorityClassName: p1
  schedulerName: volcano
  queue: gpu
  tasks:
    - replicas: 1
      name: "test"
      template:
        metadata:
          name: web
        spec:
          nodeSelector:
            kubernetes.io/hostname: k8s02
          containers:
            - image: registry.demo.com/cube-studio/gpu-player:v2  
              command: ["sh", "-c", "sleep 90"]
              imagePullPolicy: IfNotPresent
              name: nginx
              resources:
                limits:
                  volcano.sh/gpu-memory: 2048
          restartPolicy: OnFailure

# job5 run after 5s of sleep（After job4）
---
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: sc2-5
spec:
  minAvailable: 1
  priorityClassName: p1
  schedulerName: volcano
  queue: gpu
  tasks:
    - replicas: 1
      name: "test"
      template:
        metadata:
          name: web
        spec:
          nodeSelector:
            kubernetes.io/hostname: k8s02
          containers:
            - image: registry.demo.com/cube-studio/gpu-player:v2  
              command: ["sh", "-c", "sleep 90"]
              imagePullPolicy: IfNotPresent
              name: nginx
              resources:
                limits:
                  volcano.sh/gpu-memory: 2048
          restartPolicy: OnFailure

先执行sc2-1，apply 后pod马上进入 running 状态;
等待5s，执行sc2-2，apply 后pod处于 pending 状态，pg是Inqueue 状态;
等待5s，执行sc2-3，apply 后pod马上进行 running 状态;
等待5s，执行sc2-4，apply 后pod处于 pending 状态，pg是Inqueue 状态;
等待5s，执行sc2-5，apply 后pod处于 pending 状态，pg是Inqueue 状态。
90s后，sc2-1完成；sc2-4和sc2-5马上从pending进入running状态，sc2-2仍处于pending状态。
先执行且高优先级的sc2-2是最终排到了最后运行。

NAME                                         STATUS    MINMEMBER   RUNNINGS   AGE
sc2-1-0c80784c-b1f3-44b5-aabf-8a5141e01466   Running   1           1          42s
sc2-2-c36af1c3-a7ed-4f69-80e9-915f4f91fb92   Inqueue   1                      37s
sc2-3-5f9170a2-3e88-4984-b6f2-4400f197c374   Running   1           1          31s
sc2-4-6e3f1f36-fb54-4d53-972a-85b4e0a45e63   Inqueue   1                      26s
sc2-5-b18cdce8-d796-4b14-ac03-c826b00ae884   Inqueue   1                      20s

有尝试调整`volcano-scheduler-configmap`中的action和plugins，例如只使用SLA与predicates，仍然未改变运行顺序。

参考了这个issue https://github.com/volcano-sh/volcano/issues/2052 配置，但并未解决我的问题

Anything else we need to know?:

Environment:

volcano version: 1.8.0
node info
  pods:                    110
  volcano.sh/gpu-memory:   6144
  volcano.sh/gpu-number:   1
  volcano.sh/vgpu-number:  0
System Info:
  Kernel Version:             5.15.0-79-generic
  OS Image:                   Ubuntu 20.04.6 LTS
  Operating System:           linux
  Architecture:               amd64
  Container Runtime Version:  containerd://1.7.2
  Kubelet Version:            v1.27.4
  Kube-Proxy Version:         v1.27.4

Sep 03 '23 07:09 tghfly

Please try resource reservation for target jobs. https://github.com/volcano-sh/volcano/blob/master/docs/design/job-resource-reservation-design.md

Sep 04 '23 03:09 lowang-bh

The reserve action has been deprecated from v1.2 and replaced by the SLA plugin. But the SLA plugin did not take effect during the test.

Sep 04 '23 03:09 tghfly

Thanks for your report, We added this issue to pipeline and have an investigation.

Sep 11 '23 06:09 william-wang

/assgin @Mufengzhe

Sep 11 '23 06:09 william-wang

@tghfly You did not configure sla plugin in in your scheduler configmap. Please follow following doc to configure the sla to have a try. https://github.com/volcano-sh/volcano/blob/master/docs/design/sla-plugin.md

Sep 11 '23 08:09 william-wang

I have tried to use the SLA plugin, but it did not meet expectations.low priority small jobs still run higher than high priority big job

Sep 13 '23 05:09 tghfly

@Mufengzhe please have an investigation:)

Sep 13 '23 06:09 william-wang

Refer to SLA and add annotation information to sc2-1, such as: sla-waiting-time: 1s

Make sure the configmap contains at least the following plugins:

- name: proportion
- name: sla

Try it again and see if it meets expectations.

Sep 16 '23 08:09 wangyang0616

When the SLA plugin is configured to be globally valid, all jobs will check the plugin, and when all jobs end in the wait time (sla-waiting-time), they will enter the Inqueue state, all jobs that enter the Inqueue will wait for resources at the same time, and the jobs that meet the resources will be prioritized for scheduling, which is why the small jobs that enter later will be executed first. However, large jobs that enter the inqueue state will wait until the resource is satisfied, so they will wait for the end. The solution now is to remove the global SLA plugin and only use it in large jobs that need to use the plugin, I think this is still a bug and look forward to a follow-up solution

Sep 18 '23 02:09 Mufengzhe

@Mufengzhe 也不是你说的那样，不管SLA加在全局，还是只在某一个job中添加，也不管是大job还是小的job，任务执行后都会进入inqueue状态。 @wangyang0616 ，按如下配置后，仍未达到预期。先执行且优先级高的大job仍然pending，后执行优先级低的小job插队运行。

data:
  volcano-scheduler.conf: |
    actions: "enqueue, allocate, backfill"
    tiers:
    - plugins:
      - name: priority
      - name: proportion
      - name: sla
        arguments:
          sla-waiting-time: 48h
    - plugins:
      - name: overcommit
        arguments:
          overcommit-factor: 1.0
      - name: predicates
        arguments:
          predicate.GPUSharingEnable: true

Sep 19 '23 12:09 tghfly

I removed the configuration of the SLA plugin in configmap, and only added the relevant configuration of SLA in the large job that needs to be executed to achieve the desired effect, my configuration is as follows： configmap

apiVersion: v1
data:
  volcano-scheduler-ci.conf: |
    actions: "enqueue, allocate, backfill, preempt"
    tiers:
    - plugins:
      - name: priority
      - name: gang
    - plugins:
      - name: overcommit
        arguments:
          overcommit-factor: 1.0
      - name: proportion
      - name: task-topology

job

apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: job-big
  annotations:
    sla-waiting-time: 5s

you can try it again and see if it meets expectations

Sep 21 '23 03:09 Mufengzhe

@Mufengzhe Thank you. I already tried that. It didn't work。Is it related to predicates plug?

data:
  volcano-scheduler.conf: |
    actions: "enqueue, allocate, backfill, preempt"
    tiers:
    - plugins:
      - name: priority
      - name: gang
    - plugins:
      - name: overcommit
        arguments:
          overcommit-factor: 1.0
      - name: predicates
        arguments:
          predicate.GPUSharingEnable: true
     - name: proportion
     - name: task-topology

# sc2-1.yaml
---
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: sc2-1
spec:
  minAvailable: 1
  priorityClassName: p1
  schedulerName: volcano
  queue: gpu
  tasks:
    - replicas: 1
      name: "test"
      template:
        metadata:
          name: web
        spec:
          nodeSelector:
            kubernetes.io/hostname: k8s02
          containers:
            - image: registry.demo.com/cube-studio/gpu-player:v2  
              command: ["sh", "-c", "sleep 90"]
              imagePullPolicy: IfNotPresent
              name: nginx
              resources:
                limits:
                  volcano.sh/gpu-memory: 4096
          restartPolicy: OnFailure
# sc2-2.yaml
---
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: sc2-2
  annotations:
    sla-waiting-time: 5s
spec:
  minAvailable: 1
  priorityClassName: p2
  schedulerName: volcano
  queue: gpu
  tasks:
    - replicas: 1
      name: "test"
      template:
        metadata:
          name: web
        spec:
          nodeSelector:
            kubernetes.io/hostname: k8s02
          containers:
            - image: registry.demo.com/cube-studio/gpu-player:v2  
              command: ["sh", "-c", "sleep 60"]
              imagePullPolicy: IfNotPresent
              name: nginx
              resources:
                limits:
                  volcano.sh/gpu-memory: 5000
          restartPolicy: OnFailure
# sc2-3.yaml
---
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: sc2-3
spec:
  minAvailable: 1
  priorityClassName: p1
  schedulerName: volcano
  queue: gpu
  tasks:
    - replicas: 1
      name: "test"
      template:
        metadata:
          name: web
        spec:
          nodeSelector:
            kubernetes.io/hostname: k8s02
          containers:
            - image: registry.demo.com/cube-studio/gpu-player:v2  
              command: ["sh", "-c", "sleep 90"]
              imagePullPolicy: IfNotPresent
              name: nginx
              resources:
                limits:
                  volcano.sh/gpu-memory: 1500
          restartPolicy: OnFailure
# sc2-4.yaml
---
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: sc2-4
spec:
  minAvailable: 1
  priorityClassName: p1
  schedulerName: volcano
  queue: gpu
  tasks:
    - replicas: 1
      name: "test"
      template:
        metadata:
          name: web
        spec:
          nodeSelector:
            kubernetes.io/hostname: k8s02
          containers:
            - image: registry.demo.com/cube-studio/gpu-player:v2  
              command: ["sh", "-c", "sleep 90"]
              imagePullPolicy: IfNotPresent
              name: nginx
              resources:
                limits:
                  volcano.sh/gpu-memory: 2048
          restartPolicy: OnFailure
# sc2-5.yaml
---
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: sc2-5
spec:
  minAvailable: 1
  priorityClassName: p1
  schedulerName: volcano
  queue: gpu
  tasks:
    - replicas: 1
      name: "test"
      template:
        metadata:
          name: web
        spec:
          nodeSelector:
            kubernetes.io/hostname: k8s02
          containers:
            - image: registry.demo.com/cube-studio/gpu-player:v2  
              command: ["sh", "-c", "sleep 90"]
              imagePullPolicy: IfNotPresent
              name: nginx
              resources:
                limits:
                  volcano.sh/gpu-memory: 2048
          restartPolicy: OnFailure

---
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
  name: gpu
spec:
  weight: 1
  reclaimable: true
  capability:
    volcano.sh/gpu-memory: 6144
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: p1
preemptionPolicy: PreemptLowerPriority
value: 1100000
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: p2
preemptionPolicy: PreemptLowerPriority
value: 1200000

cat > start.sh <<EOF
kubectl apply -f sc2-1.yaml
sleep 5 
kubectl apply -f sc2-2.yaml
sleep 5 
kubectl apply -f sc2-3.yaml
sleep 5 
kubectl apply -f sc2-4.yaml
sleep 5
kubectl apply -f sc2-5.yaml
EOF

sh start.sh

result: sc2-1 first running; sc2-2 pending; sc2-3 skip sc2-2 to run; after sc2-1 end, sc2-4 and sc2-5 skip sc2-2 fo run; finally sc2-2 run.

Sep 21 '23 07:09 tghfly

？

Oct 24 '23 13:10 tghfly

这个issue有人跟进吗？

Oct 30 '23 02:10 tghfly

I meet the same issue. We want to try big task but there is no enough resources. So I add a volcano plugins to fix it. You should add job enqueue function (ssn.AddJobEnqueueableFn) in your volcano plugin.

ssn.AddJobEnqueueableFn(h.Name(), func(obj interface{}) int {
   ...
   return util.Reject // when request resource is greater than cluster idle resources.
   return util.Permit // when request resource is less than cluster idle resources.
}

And you should record reservation resources. // small job judges resources. Put this code block to ssn.AddJobEnqueueableFn.

if request.resources + reservedResources >= clusterIdleResources : 
    return util.Reject

So small job can be pending until big task is scheduled. In short, you need to add (take into account) the reserved resources when judging resources. For example, the number of cluster idle nvidia cards is 4. Big task requests 5 cards.

if 5 > 4 : 
   return util.Reject
When your plugin handle small job. (small big needs 1 card.)
if smallBigRequest + 5 > 4:         // 5 is reserved resources.
   return util.Reject

But one important point is recorded the reserved resources (5 card.) And you should consider the priority. So

if reservedResources.priority > job.priority : 
   // you should add the reserved resources.

I create a crd (name is queueinfo).

queueinfo_types.go

type QueueInfoSpec struct {
	QueueId        string                    `json:"queueId,omitempty"`
	JobReserveInfo map[string]JobReserveInfo `json:"jobReserveInfo"`
}

type JobReserveInfo struct {
	Priority int32               `json:"priority"`
	Resource corev1.ResourceList `json:"resource"`
}

create crd to save reservedResources and priority.
if small job's priority is less than reservedResources' priority, then add the reservedResources. So small job status will be pending until big job status is from pending to inqueue, and you should remove the reserved resources of the big job, update the crd's reserve resource field.
you can also consider sla time, when exceeded waiting time and resource is insufficient to avoid scheduling small job , you should add big job's resource to the queue info crd's jobReserveInfo field. Key is job.Uid, value is struct{ priority, resources}.

Nov 30 '23 03:11 conglanjun

Can you share this plugin ? @conglanjun

Dec 04 '23 01:12 tghfly

不可抢占吗？高优先级不会杀死低优先级的任务吗？

Feb 21 '24 10:02 zhaizhch

我也遇到类似的问题请问你们最后解决了吗

Jul 19 '24 02:07 Geyke

Is there any progress?

Aug 22 '24 01:08 cocodee

I have an idea to solve it. But it will waste computility if keeping the resource free and don't let the low-priority small requested jobs to run.

Another solution: you guys can disable preemptable in gang plugin.

Aug 22 '24 02:08 lowang-bh

I have an idea to solve it. But if will waste computility if keeping the resource free and don't let the low-priority small requested jobs to run.

Another solution: you guys can disable preemptable in gang plugin. We need the jobs to be scheduled in strict order. What's is the ideal to solve it?

Aug 22 '24 05:08 cocodee

We need the jobs to be scheduled in strict order. What's is the ideal to solve it?

We can keep the pipelined resource and don't discard it, so that low priority task can not use those resource. Then the high priority job will be scheduled in next session if it meet gang, otherwise it will hold those resource until it is scheduled successfully.

I will raise a pr when I am free recently.

Aug 22 '24 06:08 lowang-bh