volcano
volcano copied to clipboard
vgpu cannot perform high-priority preemption scheduling
What happened:
Use the latest version of Volcano vGPU, hoping that high-priority tasks can preempt low-priority tasks.
Node capacity information:
status:
capacity:
volcano.sh/vgpu-number: '2'
volcano-scheduler.conf (configmap):
actions: "reclaim, allocate, backfill, preempt"
tiers:
- plugins:
- name: priority
- plugins:
- name: gang
enableJobOrder: false
enablePreemptable: false
enableJobStarving: false
- name: predicates
arguments:
predicate.GPUSharingEnable: true # enable GPU sharing
- name: proportion
- name: nodeorder
- name: binpack
priorityClass:
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: high-priority
value: 1000000
globalDefault: false
description: "high priority"
2 low priority tasks:
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
name: job-low1
spec:
minAvailable: 1
schedulerName: volcano
policies:
- event: PodEvicted
action: RestartJob
tasks:
- replicas: 1
name: testjob
policies:
- event: TaskCompleted
action: CompleteJob
template:
metadata:
annotations:
volcano.sh/preemptable: "true"
spec:
containers:
- command:
- sleep
- 8m
name: cuda-container
image: nvidia/cuda:10.1-base-ubuntu18.04
resources:
limits:
volcano.sh/vgpu-number: 1
---
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
name: job-low2
spec:
minAvailable: 1
schedulerName: volcano
policies:
- event: PodEvicted
action: RestartJob
tasks:
- replicas: 1
name: testjob
policies:
- event: TaskCompleted
action: CompleteJob
template:
metadata:
annotations:
volcano.sh/preemptable: "true"
spec:
containers:
- command:
- sleep
- 10m
name: cuda-container
image: nvidia/cuda:10.1-base-ubuntu18.04
resources:
limits:
volcano.sh/vgpu-number: 1
1 High priority task:
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
name: job-high
spec:
minAvailable: 1
schedulerName: volcano
priorityClassName: high-priority
policies:
- event: PodEvicted
action: RestartJob
tasks:
- replicas: 1
name: testjob
policies:
- event: TaskCompleted
action: CompleteJob
template:
spec:
containers:
- command:
- sleep
- 2m
name: cuda-container
image: nvidia/cuda:10.1-base-ubuntu18.04
resources:
limits:
volcano.sh/vgpu-number: 1
What you expected to happen:
Two low-priority tasks have already occupied the entire node's vGPU resources (2 vGPUs). Now, creating a high-priority task that uses 1 vGPU should evict one of the low-priority tasks to allow the high-priority task to run, and the evicted low-priority task will be in a padding state. For example:
---begin---
NAME STATUS
job-low1 Running
Job-low2 Running
---wait---
NAME STATUS
job-high Running
job-low1 Running
Job-low2 Padding
---wait---
NAME STATUS
job-high Completed
job-low1 Running
Job-low2 Running
---end---
NAME STATUS
job-high Completed
job-low1 Completed
Job-low2 Completed
Is it my configuration error or a bug?
How to reproduce it (as minimally and precisely as possible):
- If using CPU or memory, it can trigger priority preemptive scheduling.
- Whether using gpushare or vgpu, priority preemption scheduling cannot be performed. The most obvious observation from checking the schedule log is:
I1109 12:18:49.214378 1 preempt.go:43] Enter Preempt ... I1109 12:18:49.214390 1 job_info.go:728] job job-high-14881f23-c9a4-44b9-a3cf-46e130a51b99/default actual: map[], ji.TaskMinAvailable: map[nginx:1] I1109 12:18:49.214407 1 preempt.go:58] Job <default/job-high-14881f23-c9a4-44b9-a3cf-46e130a51b99> Queue
skip preemption, reason: NotEnoughPodsOfTask, message Not enough valid pods of each task for gang-scheduling I1109 12:18:49.214463 1 job_info.go:728] job job-low2-1d2e78fa-028d-475e-9ffc-5598d837d80b/default actual: map[nginx:1], ji.TaskMinAvailable: map[testjob:1] I1109 12:18:49.214488 1 job_info.go:728] job job-low1-b024ff24-37f1-489b-8956-93e78c46a70c/default actual: map[nginx:1], ji.TaskMinAvailable: map[testjob:1] I1109 12:18:49.214509 1 preempt.go:194] No Preemptors in Queue , break I1109 12:18:49.214522 1 statement.go:378] Committing operations I1109 12:18:49.214536 1 preempt.go:194] Leaving Preempt ...
Anything else we need to know?:
Similar problems: https://github.com/volcano-sh/volcano/issues/2547 https://github.com/volcano-sh/volcano/pull/2916 ...
Environment:
- Volcano Version: 1.8.1
- Kubernetes version (use
kubectl version
): 1.19 - OS (e.g. from /etc/os-release): Ubuntu 18.04.6 LTS (image)
- Kernel (e.g.
uname -a
): 5.4.0-150-generic
Hi, please try to modify volcano-scheduler.conf's actions field to "allocate, preempt, backfill" to see the result.
Hi, please try to modify volcano-scheduler.conf's actions field to "allocate, preempt, backfill" to see the result.
@Monokaix Thank you for your reply.
According to your prompt, it still doesn't work, and the high-priority job is always in Padding
state.
The log is printed as follows:
E1110 07:30:56.874510 1 device_info.go:187] deviceSharing err= not enough gpu fitted on this node I1110 07:30:56.874524 1 predicate_helper.go:75] Predicates failed for task <default/job-testjob-nginx-0> on node
: task default/job-high-testjob-0 on node node-gpu fit failed: not enough gpu fitted on this node I1110 07:30:56.874588 1 preempt.go:108] No preemptor task in job <default/job-high-cb92f563-74f3-4912-bacd-fe230e57915a>. I1110 07:30:56.874605 1 statement.go:352] Discarding operations ... I1110 07:30:56.874629 1 predicates.go:384] pod(default/job--high-testjob-0) affinity require information is nil, plugin InterPodAffinity is skipped I1110 07:30:56.874676 1 statement.go:378] Committing operations ... I1110 07:30:56.874683 1 statement.go:378] Committing operations ... I1110 07:30:56.885269 1 cache.go:262] Updating pod condition for default/job-high-testjob-0 to (PodScheduled==False) I1110 07:30:56.930763 1 session.go:240] Close Session
It seems to be the same reason mentioned in https://github.com/volcano-sh/volcano/pull/2916, but I noticed that the issue has been fixed and submitted in version 1.8.0.
Because vgpu sharing is enabled, in the device_info.go
, the gpu resource check fails in the FilterNode()
function, resulting in an "not enough gpu fitted on this node" exception being thrown, which prevents the correct calculation of the nodes in the predicateNodes
phase, thus preventing the execution of the logic for preemption scheduling?
Could you please provide me with some solutions or ideas? @wangyang0616 @william-wang
@archlitchi Have you encountered the same issue in your env?
Hi, please try to modify volcano-scheduler.conf's actions field to "allocate, preempt, backfill" to see the result.
@Monokaix Thank you for your reply. According to your prompt, it still doesn't work, and the high-priority job is always in
Padding
state. The log is printed as follows:E1110 07:30:56.874510 1 device_info.go:187] deviceSharing err= not enough gpu fitted on this node I1110 07:30:56.874524 1 predicate_helper.go:75] Predicates failed for task <default/job-testjob-nginx-0> on node : task default/job-high-testjob-0 on node node-gpu fit failed: not enough gpu fitted on this node I1110 07:30:56.874588 1 preempt.go:108] No preemptor task in job <default/job-high-cb92f563-74f3-4912-bacd-fe230e57915a>. I1110 07:30:56.874605 1 statement.go:352] Discarding operations ... I1110 07:30:56.874629 1 predicates.go:384] pod(default/job--high-testjob-0) affinity require information is nil, plugin InterPodAffinity is skipped I1110 07:30:56.874676 1 statement.go:378] Committing operations ... I1110 07:30:56.874683 1 statement.go:378] Committing operations ... I1110 07:30:56.885269 1 cache.go:262] Updating pod condition for default/job-high-testjob-0 to (PodScheduled==False) I1110 07:30:56.930763 1 session.go:240] Close Session
Whether can preempt happen using cpu/memory after modify schedule config?
Whether can preempt happen using cpu/memory after modify schedule config?
That's it. For CPU/memory, preemption works normally, but for vgpu/gpu-share resources, preemption does not work and high-priority tasks are always in Padding
state.
What's the node's allocatable status?
https://github.com/volcano-sh/volcano/pull/3450 and https://github.com/volcano-sh/volcano/pull/3458 can solve this, you can try it using the latest version: )
/close
@Monokaix: Closing this issue.
In response to this:
/close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.