volcano icon indicating copy to clipboard operation
volcano copied to clipboard

Volcano scheduler doesn’t reclaim/preempt jobs as a gang.

Open mtian29 opened this issue 6 months ago • 19 comments

What is the problem you're trying to solve

When reclaiming resources, Volcano checks every node in the cluster. For each node, it looks for pods that it can evict ("victims"). It keeps doing this node by node until it frees up enough resources to run the new task.

This method can cause problems for clusters running distributed AI training jobs.

For example: Suppose your cluster is running 10 distributed training jobs (Job 1 through Job 10), and each job has 10 pods. With the current way Volcano works, it might evict only 1 pod from each job to reclaim resources. This means all 10 jobs fail, since each job needs all its pods (“gang scheduling”).

But we’d prefer Volcano to evict all 10 pods from just one job (say, Job 1). That way, only Job 1 fails, and the other 9 jobs keep training.

TLDR: We want Volcano to fully evict a single job rather than partially evicting many jobs, which would cause more overall failures.

Describe the solution you'd like

We want Volcano to remove (evict) as few jobs as possible when freeing up resources. Just like with gang scheduling—where all pods in a job are scheduled together—Volcano can evict an entire job together, as a “gang,” instead of evicting a few pods from many jobs.

This means:

Volcano will try to evict full jobs, not just single pods from different jobs. This way, fewer jobs are affected, and more jobs can keep running without being interrupted.

Additional context

We are using this feature with Raycluster and PytorchJob CRD

No response

mtian29 avatar Sep 09 '25 16:09 mtian29

That's truly an issue, we can implment it as a new feature.

Monokaix avatar Sep 11 '25 09:09 Monokaix

Thanks. Let me know your thoughts on this proposal.

Proposed change in reclaim.go

For preemptor task PT1 in a job PJ1,

Loop over the nodes;
For each node, find all victims tasks VTs:

  1. If sum(VT.resources) can fulfill PT1, this means we can pipeline PT1 on this node by reclaiming the victims. a. Pop a victim task victim1 and get its job name job1.
    b. Evict job1 and calculate the reclaimed resources as map1 (e.g. node -> reclaimed_resources). c. If any node in this map1 can fulfill the PT1 after this eviction, the reclaim action can stop even though there are other preemptors (PT2, PT3..), and the task PT1 can be pipelined. Otherwise go back to step a. d. In the end, when we quit the loop for this node, PT1 must be pipelined.

  2. If sum(VT.resources) can NOT fulfill PT1, go to the next node.

Then in the next scheduling loop, volcano will enqueue, allocate to see if the PJ1 can be scheduled.

As a result, Each reclaim action only do the reclaim for one preemptor task PT. (Question: this might slow down the scheduler, but is it a concern to you?)

Why not evaluating a preemptor job against all victim jobs in the cluster

Alternatively, evaluating a preemptor job against all victim jobs in the cluster within reclaim.go and determining which set of victim jobs to evict is complicated by node fragmentation. For example, if a job using 4 GPUs on node1 and 4 GPUs on node2 is evicted, these fragmented resources still can not be assignable to a job requiring 8 GPUs on a single node.

mtian29 avatar Sep 12 '25 00:09 mtian29

That's truly an issue, we can implment it as a new feature.

@Monokaix I agree this is a genuine issue and could be resolved with a new feature. I have a preliminary implementation on a local branch. Would you be open to reviewing a PR? I’ll refine the code before opening it.

mvinchoo avatar Sep 19 '25 19:09 mvinchoo

@mvinchoo Thanks for the implementation. Could you add me as well when PR is ready?

mtian29 avatar Sep 19 '25 19:09 mtian29

@mtian29 @Monokaix Let me what you guys think: https://github.com/volcano-sh/volcano/pull/4637 This is a light weight solution but there is a lot of room for future work. 😄

mvinchoo avatar Sep 24 '25 06:09 mvinchoo

It's a good feature, I think we can discuss implementing it as a main feature in the next release( v1.14.0), as we also need it in our internal volcano, job-level preemption is what we have always wanted to achieve. We can set up a feature development team to implement it if you guys have interest and time @mtian29 @mvinchoo

JesseStutler avatar Sep 26 '25 07:09 JesseStutler

It's a good feature, I think we can discuss implementing it as a main feature in the next release( v1.14.0), as we also need it in our internal volcano, job-level preemption is what we have always wanted to achieve. We can set up a feature development team to implement it if you guys have interest and time @mtian29 @mvinchoo

@JesseStutler Yes, I’d be very interested in contributing. I’ve previously developed a custom_preempt action for my employer and have built up solid expertise in this part of the scheduler. I’d be glad to collaborate on implementing job-level preemption as a core feature.

mvinchoo avatar Sep 26 '25 07:09 mvinchoo

Hi @mtian29 @JesseStutler @hajnalmt I spent some time over the weekend expanding the original lightweight change into a full-fledged feature with multiple policies. size/Msize/XL. When you have a moment, please take a look and let me know what you think.

mvinchoo avatar Oct 06 '25 02:10 mvinchoo

@mvinchoo @JesseStutler Hello guys, this PR https://github.com/volcano-sh/volcano/pull/4637 seems extending the preempt action with gang feature option within queue, but it does not support gang reclaim between queues, any plan on this?

gqcn avatar Nov 11 '25 11:11 gqcn

Hi @gqcn This PR is large and has not yet been approved by the Volcano reviewers. Once the whole team is on the same page for this feature and its implementation, it should be very easy to replicate this logic for the reclaim action.

We would love to have your feedback on how we can improve today’s gang aware preemption.

mvinchoo avatar Nov 11 '25 19:11 mvinchoo

Currently, more and more users are expecting job-level preemption/reclaimation. @lowang-bh has his own implementation, and @vzhou-p is also interested in participating in the development. We are also developing a prototype of this in our own volcano product. We can unify the design by using Google Docs or discussing in this issues. I will review #4637 when I have time, @lowang-bh please take a look at #4637 as well

BTW @mvinchoo @mtian29 @gqcn Do you mainly use wechat or slack?

JesseStutler avatar Nov 20 '25 01:11 JesseStutler

Currently, more and more users are expecting job-level preemption/reclaimation. @lowang-bh has his own implementation, and @vzhou-p is also interested in participating in the development. We are also developing a prototype of this in our own volcano product. We can unify the design by using Google Docs or discussing in this issues. I will review #4637 when I have time, @lowang-bh please take a look at #4637 as well

BTW @mvinchoo @mtian29 @gqcn Do you mainly use wechat or slack?

Hi I don’t have a WeChat account. Slack would be best ([email protected])!

mvinchoo avatar Nov 20 '25 02:11 mvinchoo

Currently, more and more users are expecting job-level preemption/reclaimation. @lowang-bh has his own implementation, and @vzhou-p is also interested in participating in the development. We are also developing a prototype of this in our own volcano product. We can unify the design by using Google Docs or discussing in this issues. I will review #4637 when I have time, @lowang-bh please take a look at #4637 as well

BTW @mvinchoo @mtian29 @gqcn Do you mainly use wechat or slack?

Both are fine with me. Is there a public slack channel I can join?

mtian29 avatar Nov 20 '25 19:11 mtian29

Currently, more and more users are expecting job-level preemption/reclaimation. @lowang-bh has his own implementation, and @vzhou-p is also interested in participating in the development. We are also developing a prototype of this in our own volcano product. We can unify the design by using Google Docs or discussing in this issues. I will review #4637 when I have time, @lowang-bh please take a look at #4637 as well BTW @mvinchoo @mtian29 @gqcn Do you mainly use wechat or slack?

Hi I don’t have a WeChat account. Slack would be best ([email protected])!

Currently, more and more users are expecting job-level preemption/reclaimation. @lowang-bh has his own implementation, and @vzhou-p is also interested in participating in the development. We are also developing a prototype of this in our own volcano product. We can unify the design by using Google Docs or discussing in this issues. I will review #4637 when I have time, @lowang-bh please take a look at #4637 as well BTW @mvinchoo @mtian29 @gqcn Do you mainly use wechat or slack?

Both are fine with me. Is there a public slack channel I can join?

@mtian29 @mvinchoo The slack link is here: https://cloud-native.slack.com/?redir=%2Farchives%2FC011GJDQS0N%3Fname%3DC011GJDQS0N

JesseStutler avatar Nov 21 '25 01:11 JesseStutler

Currently, more and more users are expecting job-level preemption/reclaimation. @lowang-bh has his own implementation, and @vzhou-p is also interested in participating in the development. We are also developing a prototype of this in our own volcano product. We can unify the design by using Google Docs or discussing in this issues. I will review #4637 when I have time, @lowang-bh please take a look at #4637 as well BTW @mvinchoo @mtian29 @gqcn Do you mainly use wechat or slack?

Hi I don’t have a WeChat account. Slack would be best ([email protected])!

Currently, more and more users are expecting job-level preemption/reclaimation. @lowang-bh has his own implementation, and @vzhou-p is also interested in participating in the development. We are also developing a prototype of this in our own volcano product. We can unify the design by using Google Docs or discussing in this issues. I will review #4637 when I have time, @lowang-bh please take a look at #4637 as well BTW @mvinchoo @mtian29 @gqcn Do you mainly use wechat or slack?

Both are fine with me. Is there a public slack channel I can join?

@mtian29 @mvinchoo The slack link is here: https://cloud-native.slack.com/?redir=%2Farchives%2FC011GJDQS0N%3Fname%3DC011GJDQS0N

@JesseStutler You would have to invite us to this Slack workspace

[email protected] doesn’t have an account on this workspace.

mvinchoo avatar Nov 21 '25 05:11 mvinchoo

Currently, more and more users are expecting job-level preemption/reclaimation. @lowang-bh has his own implementation, and @vzhou-p is also interested in participating in the development. We are also developing a prototype of this in our own volcano product. We can unify the design by using Google Docs or discussing in this issues. I will review #4637 when I have time, @lowang-bh please take a look at #4637 as well

BTW @mvinchoo @mtian29 @gqcn Do you mainly use wechat or slack?

Hi, I solved the gang reclaiming problem using the policies of the volcano job in our scenario, which makes all tasks of job aborted when one of them was evicted. Like this:

apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: training-low
  annotations:
    volcano.sh/preemptable: "true"
spec:
  minAvailable: 2
  priorityClassName: pc-low
  schedulerName: volcano
  queue: queue-training
  policies:
  - event: TaskCompleted
    action: CompleteJob
  - event: PodEvicted
    action: AbortJob 
  tasks:
  - replicas: 1
    name: worker1
    template:
      metadata:
        annotations:
          volcano.sh/card.name: NVIDIA-H200
      spec:
        affinity:
          nodeAffinity:
            requiredDuringSchedulingIgnoredDuringExecution:
              nodeSelectorTerms:
              - matchExpressions:
                - key: nvidia.com/gpu.product
                  operator: In
                  values:
                  - NVIDIA-H200
        restartPolicy: Never
        containers:
        - image: alpine:latest
          imagePullPolicy: IfNotPresent
          name: worker
          command: ["sh", "-c", "sleep 1d"]
          resources:
            requests:
              cpu: 100m
              memory: 100Mi
              nvidia.com/gpu: 4
            limits:
              cpu: 100m
              memory: 100Mi
              nvidia.com/gpu: 4
  - replicas: 1
    name: worker2
    template:
      metadata:
        annotations:
          volcano.sh/card.name: NVIDIA-H200
      spec:
        affinity:
          nodeAffinity:
            requiredDuringSchedulingIgnoredDuringExecution:
              nodeSelectorTerms:
              - matchExpressions:
                - key: nvidia.com/gpu.product
                  operator: In
                  values:
                  - NVIDIA-H200
        restartPolicy: Never
        containers:
        - image: alpine:latest
          imagePullPolicy: IfNotPresent
          name: worker
          command: ["sh", "-c", "sleep 1d"]
          resources:
            requests:
              cpu: 100m
              memory: 100Mi
              nvidia.com/gpu: 4
            limits:
              cpu: 100m
              memory: 100Mi
              nvidia.com/gpu: 4

It seems, we do not really need such gang reclaiming feature implements currently, but hope it would not be complicated if necessary.

I use wechat mostly, my id is johnguo2023 if you guys like a chat.

gqcn avatar Nov 25 '25 01:11 gqcn

@gqcn If I understand correctly, this policy approach only addresses the case where, if any task in a gang is evicted, the whole job fails. It doesn’t solve the problem of how to choose victims with respect to gang so it’s still possible for tasks from multiple gangs to be evicted. Ideally, our goal is to minimize the number of gangs impacted during preemption or reclaim.

vzhou-p avatar Nov 25 '25 22:11 vzhou-p

Currently, more and more users are expecting job-level preemption/reclaimation. @lowang-bh has his own implementation, and @vzhou-p is also interested in participating in the development. We are also developing a prototype of this in our own volcano product. We can unify the design by using Google Docs or discussing in this issues. I will review #4637 when I have time, @lowang-bh please take a look at #4637 as well BTW @mvinchoo @mtian29 @gqcn Do you mainly use wechat or slack?

Hi I don’t have a WeChat account. Slack would be best ([email protected])!

Currently, more and more users are expecting job-level preemption/reclaimation. @lowang-bh has his own implementation, and @vzhou-p is also interested in participating in the development. We are also developing a prototype of this in our own volcano product. We can unify the design by using Google Docs or discussing in this issues. I will review #4637 when I have time, @lowang-bh please take a look at #4637 as well BTW @mvinchoo @mtian29 @gqcn Do you mainly use wechat or slack?

Both are fine with me. Is there a public slack channel I can join?

@mtian29 @mvinchoo The slack link is here: https://cloud-native.slack.com/?redir=%2Farchives%2FC011GJDQS0N%3Fname%3DC011GJDQS0N

@JesseStutler You would have to invite us to this Slack workspace

[email protected] doesn’t have an account on this workspace.

@mvinchoo I invited you to join the volcano slack channel. I don't know whether you have received it, https://cloud-native.slack.com/archives/C011GJDQS0N, could you create an account and join this channel? My slack channel id is Jesse Chen

JesseStutler avatar Nov 28 '25 08:11 JesseStutler

This is a very useful feature. I'd like to know what the progress is.

zhaizhicheng avatar Dec 03 '25 06:12 zhaizhicheng