volcano vgpu cannot perform high-priority preemption scheduling

What happened:

Use the latest version of Volcano vGPU, hoping that high-priority tasks can preempt low-priority tasks.

Node capacity information:

status:
  capacity:
    volcano.sh/vgpu-number: '2'

volcano-scheduler.conf (configmap)：

actions: "reclaim, allocate, backfill, preempt"
tiers:
- plugins:
  - name: priority
- plugins:
  - name: gang
    enableJobOrder: false
    enablePreemptable: false
    enableJobStarving: false
  - name: predicates
    arguments:
      predicate.GPUSharingEnable: true # enable GPU sharing
  - name: proportion
  - name: nodeorder
  - name: binpack

priorityClass:

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: high-priority
value: 1000000
globalDefault: false
description: "high priority"

2 low priority tasks:

apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: job-low1
spec:
  minAvailable: 1
  schedulerName: volcano
  policies:
    - event: PodEvicted
      action: RestartJob
  tasks:
    - replicas: 1
      name: testjob
      policies:
      - event: TaskCompleted
        action: CompleteJob
      template:
        metadata:
          annotations: 
            volcano.sh/preemptable: "true"
        spec:
          containers:
            - command:
              - sleep
              - 8m
              name: cuda-container
              image: nvidia/cuda:10.1-base-ubuntu18.04
              resources:
                limits:
                  volcano.sh/vgpu-number: 1
---
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: job-low2
spec:
  minAvailable: 1
  schedulerName: volcano
  policies:
    - event: PodEvicted
      action: RestartJob
  tasks:
    - replicas: 1
      name: testjob
      policies:
      - event: TaskCompleted
        action: CompleteJob
      template:
        metadata:
          annotations: 
            volcano.sh/preemptable: "true"
        spec:
          containers:
            - command:
              - sleep
              - 10m
              name: cuda-container
              image: nvidia/cuda:10.1-base-ubuntu18.04
              resources:
                limits:
                  volcano.sh/vgpu-number: 1

1 High priority task:

apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: job-high
spec:
  minAvailable: 1
  schedulerName: volcano
  priorityClassName: high-priority
  policies:
    - event: PodEvicted
      action: RestartJob
  tasks:
    - replicas: 1
      name: testjob
      policies:
      - event: TaskCompleted
        action: CompleteJob
      template:
        spec:
          containers:
            - command:
              - sleep
              - 2m
              name: cuda-container
              image: nvidia/cuda:10.1-base-ubuntu18.04
              resources:
                limits:
                  volcano.sh/vgpu-number: 1

What you expected to happen:

Two low-priority tasks have already occupied the entire node's vGPU resources (2 vGPUs). Now, creating a high-priority task that uses 1 vGPU should evict one of the low-priority tasks to allow the high-priority task to run, and the evicted low-priority task will be in a padding state. For example:

---begin---
NAME     STATUS
job-low1 Running
Job-low2 Running

---wait---
NAME     STATUS
job-high Running
job-low1 Running
Job-low2 Padding

---wait---
NAME     STATUS
job-high Completed
job-low1 Running
Job-low2 Running

---end--- 
NAME STATUS 
job-high Completed 
job-low1 Completed
Job-low2 Completed

Is it my configuration error or a bug?

How to reproduce it (as minimally and precisely as possible):

If using CPU or memory, it can trigger priority preemptive scheduling.
Whether using gpushare or vgpu, priority preemption scheduling cannot be performed. The most obvious observation from checking the schedule log is:

I1109 12:18:49.214378 1 preempt.go:43] Enter Preempt ... I1109 12:18:49.214390 1 job_info.go:728] job job-high-14881f23-c9a4-44b9-a3cf-46e130a51b99/default actual: map[], ji.TaskMinAvailable: map[nginx:1] I1109 12:18:49.214407 1 preempt.go:58] Job <default/job-high-14881f23-c9a4-44b9-a3cf-46e130a51b99> Queue skip preemption, reason: NotEnoughPodsOfTask, message Not enough valid pods of each task for gang-scheduling I1109 12:18:49.214463 1 job_info.go:728] job job-low2-1d2e78fa-028d-475e-9ffc-5598d837d80b/default actual: map[nginx:1], ji.TaskMinAvailable: map[testjob:1] I1109 12:18:49.214488 1 job_info.go:728] job job-low1-b024ff24-37f1-489b-8956-93e78c46a70c/default actual: map[nginx:1], ji.TaskMinAvailable: map[testjob:1] I1109 12:18:49.214509 1 preempt.go:194] No Preemptors in Queue , break I1109 12:18:49.214522 1 statement.go:378] Committing operations I1109 12:18:49.214536 1 preempt.go:194] Leaving Preempt ...

Anything else we need to know?:

Similar problems: https://github.com/volcano-sh/volcano/issues/2547 https://github.com/volcano-sh/volcano/pull/2916 ...

Environment:

Volcano Version: 1.8.1
Kubernetes version (use kubectl version): 1.19
OS (e.g. from /etc/os-release): Ubuntu 18.04.6 LTS (image)
Kernel (e.g. uname -a): 5.4.0-150-generic

Nov 09 '23 13:11 AshinWu

Hi, please try to modify volcano-scheduler.conf's actions field to "allocate, preempt, backfill" to see the result.

Nov 10 '23 06:11 Monokaix

Hi, please try to modify volcano-scheduler.conf's actions field to "allocate, preempt, backfill" to see the result.

@Monokaix Thank you for your reply. According to your prompt, it still doesn't work, and the high-priority job is always in Padding state. The log is printed as follows:

E1110 07:30:56.874510 1 device_info.go:187] deviceSharing err= not enough gpu fitted on this node I1110 07:30:56.874524 1 predicate_helper.go:75] Predicates failed for task <default/job-testjob-nginx-0> on node : task default/job-high-testjob-0 on node node-gpu fit failed: not enough gpu fitted on this node I1110 07:30:56.874588 1 preempt.go:108] No preemptor task in job <default/job-high-cb92f563-74f3-4912-bacd-fe230e57915a>. I1110 07:30:56.874605 1 statement.go:352] Discarding operations ... I1110 07:30:56.874629 1 predicates.go:384] pod(default/job--high-testjob-0) affinity require information is nil, plugin InterPodAffinity is skipped I1110 07:30:56.874676 1 statement.go:378] Committing operations ... I1110 07:30:56.874683 1 statement.go:378] Committing operations ... I1110 07:30:56.885269 1 cache.go:262] Updating pod condition for default/job-high-testjob-0 to (PodScheduled==False) I1110 07:30:56.930763 1 session.go:240] Close Session

Nov 10 '23 07:11 AshinWu

It seems to be the same reason mentioned in https://github.com/volcano-sh/volcano/pull/2916, but I noticed that the issue has been fixed and submitted in version 1.8.0. Because vgpu sharing is enabled, in the device_info.go , the gpu resource check fails in the FilterNode() function, resulting in an "not enough gpu fitted on this node" exception being thrown, which prevents the correct calculation of the nodes in the predicateNodes phase, thus preventing the execution of the logic for preemption scheduling?

Could you please provide me with some solutions or ideas? @wangyang0616 @william-wang

Nov 11 '23 06:11 AshinWu

@archlitchi Have you encountered the same issue in your env?

Nov 15 '23 03:11 william-wang

Hi, please try to modify volcano-scheduler.conf's actions field to "allocate, preempt, backfill" to see the result.

@Monokaix Thank you for your reply. According to your prompt, it still doesn't work, and the high-priority job is always in Padding state. The log is printed as follows:

E1110 07:30:56.874510 1 device_info.go:187] deviceSharing err= not enough gpu fitted on this node I1110 07:30:56.874524 1 predicate_helper.go:75] Predicates failed for task <default/job-testjob-nginx-0> on node : task default/job-high-testjob-0 on node node-gpu fit failed: not enough gpu fitted on this node I1110 07:30:56.874588 1 preempt.go:108] No preemptor task in job <default/job-high-cb92f563-74f3-4912-bacd-fe230e57915a>. I1110 07:30:56.874605 1 statement.go:352] Discarding operations ... I1110 07:30:56.874629 1 predicates.go:384] pod(default/job--high-testjob-0) affinity require information is nil, plugin InterPodAffinity is skipped I1110 07:30:56.874676 1 statement.go:378] Committing operations ... I1110 07:30:56.874683 1 statement.go:378] Committing operations ... I1110 07:30:56.885269 1 cache.go:262] Updating pod condition for default/job-high-testjob-0 to (PodScheduled==False) I1110 07:30:56.930763 1 session.go:240] Close Session

Whether can preempt happen using cpu/memory after modify schedule config?

Nov 17 '23 08:11 Monokaix

Whether can preempt happen using cpu/memory after modify schedule config?

That's it. For CPU/memory, preemption works normally, but for vgpu/gpu-share resources, preemption does not work and high-priority tasks are always in Padding state.

Nov 17 '23 09:11 AshinWu

What's the node's allocatable status?

Jan 19 '24 03:01 Monokaix

https://github.com/volcano-sh/volcano/pull/3450 and https://github.com/volcano-sh/volcano/pull/3458 can solve this, you can try it using the latest version: )

May 14 '24 06:05 Monokaix

/close

May 20 '24 12:05 Monokaix

@Monokaix: Closing this issue.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

May 20 '24 12:05 volcano-sh-bot