Volcano scheduler doesn’t reclaim/preempt jobs as a gang.
What is the problem you're trying to solve
When reclaiming resources, Volcano checks every node in the cluster. For each node, it looks for pods that it can evict ("victims"). It keeps doing this node by node until it frees up enough resources to run the new task.
This method can cause problems for clusters running distributed AI training jobs.
For example: Suppose your cluster is running 10 distributed training jobs (Job 1 through Job 10), and each job has 10 pods. With the current way Volcano works, it might evict only 1 pod from each job to reclaim resources. This means all 10 jobs fail, since each job needs all its pods (“gang scheduling”).
But we’d prefer Volcano to evict all 10 pods from just one job (say, Job 1). That way, only Job 1 fails, and the other 9 jobs keep training.
TLDR: We want Volcano to fully evict a single job rather than partially evicting many jobs, which would cause more overall failures.
Describe the solution you'd like
We want Volcano to remove (evict) as few jobs as possible when freeing up resources. Just like with gang scheduling—where all pods in a job are scheduled together—Volcano can evict an entire job together, as a “gang,” instead of evicting a few pods from many jobs.
This means:
Volcano will try to evict full jobs, not just single pods from different jobs. This way, fewer jobs are affected, and more jobs can keep running without being interrupted.
Additional context
We are using this feature with Raycluster and PytorchJob CRD
No response
That's truly an issue, we can implment it as a new feature.
Thanks. Let me know your thoughts on this proposal.
Proposed change in reclaim.go
For preemptor task PT1 in a job PJ1,
Loop over the nodes;
For each node, find all victims tasks VTs:
-
If sum(VT.resources) can fulfill PT1, this means we can pipeline PT1 on this node by reclaiming the victims. a. Pop a victim task victim1 and get its job name job1.
b. Evict job1 and calculate the reclaimed resources as map1 (e.g. node -> reclaimed_resources). c. If any node in this map1 can fulfill the PT1 after this eviction, the reclaim action can stop even though there are other preemptors (PT2, PT3..), and the task PT1 can be pipelined. Otherwise go back to step a. d. In the end, when we quit the loop for this node, PT1 must be pipelined. -
If sum(VT.resources) can NOT fulfill PT1, go to the next node.
Then in the next scheduling loop, volcano will enqueue, allocate to see if the PJ1 can be scheduled.
As a result, Each reclaim action only do the reclaim for one preemptor task PT. (Question: this might slow down the scheduler, but is it a concern to you?)
Why not evaluating a preemptor job against all victim jobs in the cluster
Alternatively, evaluating a preemptor job against all victim jobs in the cluster within reclaim.go and determining which set of victim jobs to evict is complicated by node fragmentation. For example, if a job using 4 GPUs on node1 and 4 GPUs on node2 is evicted, these fragmented resources still can not be assignable to a job requiring 8 GPUs on a single node.
That's truly an issue, we can implment it as a new feature.
@Monokaix I agree this is a genuine issue and could be resolved with a new feature. I have a preliminary implementation on a local branch. Would you be open to reviewing a PR? I’ll refine the code before opening it.
@mvinchoo Thanks for the implementation. Could you add me as well when PR is ready?
@mtian29 @Monokaix Let me what you guys think: https://github.com/volcano-sh/volcano/pull/4637 This is a light weight solution but there is a lot of room for future work. 😄
It's a good feature, I think we can discuss implementing it as a main feature in the next release( v1.14.0), as we also need it in our internal volcano, job-level preemption is what we have always wanted to achieve. We can set up a feature development team to implement it if you guys have interest and time @mtian29 @mvinchoo
It's a good feature, I think we can discuss implementing it as a main feature in the next release( v1.14.0), as we also need it in our internal volcano, job-level preemption is what we have always wanted to achieve. We can set up a feature development team to implement it if you guys have interest and time @mtian29 @mvinchoo
@JesseStutler
Yes, I’d be very interested in contributing. I’ve previously developed a custom_preempt action for my employer and have built up solid expertise in this part of the scheduler. I’d be glad to collaborate on implementing job-level preemption as a core feature.
Hi @mtian29 @JesseStutler @hajnalmt
I spent some time over the weekend expanding the original lightweight change into a full-fledged feature with multiple policies. size/M → size/XL.
When you have a moment, please take a look and let me know what you think.
@mvinchoo @JesseStutler Hello guys, this PR https://github.com/volcano-sh/volcano/pull/4637 seems extending the preempt action with gang feature option within queue, but it does not support gang reclaim between queues, any plan on this?
Hi @gqcn This PR is large and has not yet been approved by the Volcano reviewers. Once the whole team is on the same page for this feature and its implementation, it should be very easy to replicate this logic for the reclaim action.
We would love to have your feedback on how we can improve today’s gang aware preemption.
Currently, more and more users are expecting job-level preemption/reclaimation. @lowang-bh has his own implementation, and @vzhou-p is also interested in participating in the development. We are also developing a prototype of this in our own volcano product. We can unify the design by using Google Docs or discussing in this issues. I will review #4637 when I have time, @lowang-bh please take a look at #4637 as well
BTW @mvinchoo @mtian29 @gqcn Do you mainly use wechat or slack?
Currently, more and more users are expecting job-level preemption/reclaimation. @lowang-bh has his own implementation, and @vzhou-p is also interested in participating in the development. We are also developing a prototype of this in our own volcano product. We can unify the design by using Google Docs or discussing in this issues. I will review #4637 when I have time, @lowang-bh please take a look at #4637 as well
BTW @mvinchoo @mtian29 @gqcn Do you mainly use wechat or slack?
Hi I don’t have a WeChat account. Slack would be best ([email protected])!
Currently, more and more users are expecting job-level preemption/reclaimation. @lowang-bh has his own implementation, and @vzhou-p is also interested in participating in the development. We are also developing a prototype of this in our own volcano product. We can unify the design by using Google Docs or discussing in this issues. I will review #4637 when I have time, @lowang-bh please take a look at #4637 as well
BTW @mvinchoo @mtian29 @gqcn Do you mainly use wechat or slack?
Both are fine with me. Is there a public slack channel I can join?
Currently, more and more users are expecting job-level preemption/reclaimation. @lowang-bh has his own implementation, and @vzhou-p is also interested in participating in the development. We are also developing a prototype of this in our own volcano product. We can unify the design by using Google Docs or discussing in this issues. I will review #4637 when I have time, @lowang-bh please take a look at #4637 as well BTW @mvinchoo @mtian29 @gqcn Do you mainly use wechat or slack?
Hi I don’t have a WeChat account. Slack would be best ([email protected])!
Currently, more and more users are expecting job-level preemption/reclaimation. @lowang-bh has his own implementation, and @vzhou-p is also interested in participating in the development. We are also developing a prototype of this in our own volcano product. We can unify the design by using Google Docs or discussing in this issues. I will review #4637 when I have time, @lowang-bh please take a look at #4637 as well BTW @mvinchoo @mtian29 @gqcn Do you mainly use wechat or slack?
Both are fine with me. Is there a public slack channel I can join?
@mtian29 @mvinchoo The slack link is here: https://cloud-native.slack.com/?redir=%2Farchives%2FC011GJDQS0N%3Fname%3DC011GJDQS0N
Currently, more and more users are expecting job-level preemption/reclaimation. @lowang-bh has his own implementation, and @vzhou-p is also interested in participating in the development. We are also developing a prototype of this in our own volcano product. We can unify the design by using Google Docs or discussing in this issues. I will review #4637 when I have time, @lowang-bh please take a look at #4637 as well BTW @mvinchoo @mtian29 @gqcn Do you mainly use wechat or slack?
Hi I don’t have a WeChat account. Slack would be best ([email protected])!
Currently, more and more users are expecting job-level preemption/reclaimation. @lowang-bh has his own implementation, and @vzhou-p is also interested in participating in the development. We are also developing a prototype of this in our own volcano product. We can unify the design by using Google Docs or discussing in this issues. I will review #4637 when I have time, @lowang-bh please take a look at #4637 as well BTW @mvinchoo @mtian29 @gqcn Do you mainly use wechat or slack?
Both are fine with me. Is there a public slack channel I can join?
@mtian29 @mvinchoo The slack link is here: https://cloud-native.slack.com/?redir=%2Farchives%2FC011GJDQS0N%3Fname%3DC011GJDQS0N
@JesseStutler You would have to invite us to this Slack workspace
[email protected] doesn’t have an account on this workspace.
Currently, more and more users are expecting job-level preemption/reclaimation. @lowang-bh has his own implementation, and @vzhou-p is also interested in participating in the development. We are also developing a prototype of this in our own volcano product. We can unify the design by using Google Docs or discussing in this issues. I will review #4637 when I have time, @lowang-bh please take a look at #4637 as well
BTW @mvinchoo @mtian29 @gqcn Do you mainly use wechat or slack?
Hi, I solved the gang reclaiming problem using the policies of the volcano job in our scenario, which makes all tasks of job aborted when one of them was evicted. Like this:
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
name: training-low
annotations:
volcano.sh/preemptable: "true"
spec:
minAvailable: 2
priorityClassName: pc-low
schedulerName: volcano
queue: queue-training
policies:
- event: TaskCompleted
action: CompleteJob
- event: PodEvicted
action: AbortJob
tasks:
- replicas: 1
name: worker1
template:
metadata:
annotations:
volcano.sh/card.name: NVIDIA-H200
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: nvidia.com/gpu.product
operator: In
values:
- NVIDIA-H200
restartPolicy: Never
containers:
- image: alpine:latest
imagePullPolicy: IfNotPresent
name: worker
command: ["sh", "-c", "sleep 1d"]
resources:
requests:
cpu: 100m
memory: 100Mi
nvidia.com/gpu: 4
limits:
cpu: 100m
memory: 100Mi
nvidia.com/gpu: 4
- replicas: 1
name: worker2
template:
metadata:
annotations:
volcano.sh/card.name: NVIDIA-H200
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: nvidia.com/gpu.product
operator: In
values:
- NVIDIA-H200
restartPolicy: Never
containers:
- image: alpine:latest
imagePullPolicy: IfNotPresent
name: worker
command: ["sh", "-c", "sleep 1d"]
resources:
requests:
cpu: 100m
memory: 100Mi
nvidia.com/gpu: 4
limits:
cpu: 100m
memory: 100Mi
nvidia.com/gpu: 4
It seems, we do not really need such gang reclaiming feature implements currently, but hope it would not be complicated if necessary.
I use wechat mostly, my id is johnguo2023 if you guys like a chat.
@gqcn If I understand correctly, this policy approach only addresses the case where, if any task in a gang is evicted, the whole job fails. It doesn’t solve the problem of how to choose victims with respect to gang so it’s still possible for tasks from multiple gangs to be evicted. Ideally, our goal is to minimize the number of gangs impacted during preemption or reclaim.
Currently, more and more users are expecting job-level preemption/reclaimation. @lowang-bh has his own implementation, and @vzhou-p is also interested in participating in the development. We are also developing a prototype of this in our own volcano product. We can unify the design by using Google Docs or discussing in this issues. I will review #4637 when I have time, @lowang-bh please take a look at #4637 as well BTW @mvinchoo @mtian29 @gqcn Do you mainly use wechat or slack?
Hi I don’t have a WeChat account. Slack would be best ([email protected])!
Currently, more and more users are expecting job-level preemption/reclaimation. @lowang-bh has his own implementation, and @vzhou-p is also interested in participating in the development. We are also developing a prototype of this in our own volcano product. We can unify the design by using Google Docs or discussing in this issues. I will review #4637 when I have time, @lowang-bh please take a look at #4637 as well BTW @mvinchoo @mtian29 @gqcn Do you mainly use wechat or slack?
Both are fine with me. Is there a public slack channel I can join?
@mtian29 @mvinchoo The slack link is here: https://cloud-native.slack.com/?redir=%2Farchives%2FC011GJDQS0N%3Fname%3DC011GJDQS0N
@JesseStutler You would have to invite us to this Slack workspace
[email protected] doesn’t have an account on this workspace.
@mvinchoo I invited you to join the volcano slack channel. I don't know whether you have received it, https://cloud-native.slack.com/archives/C011GJDQS0N, could you create an account and join this channel? My slack channel id is Jesse Chen
This is a very useful feature. I'd like to know what the progress is.