volcano Feature: Enhance Network Topology Aware Scheduling with HyperNode-Level Resource Usage Aware Binpacking

What is the problem you're trying to solve

The current Network Topology Aware plugin optimizes RDMA network topology (e.g., Fat Tree) for distributed jobs by scheduling pods to topologically close nodes. While Volcano already supports intra-job HyperNode binpacking (packing tasks of a single job into compact HyperNodes), it lacks HyperNode-Level Resource Usage Aware Binpacking across hierarchical units (e.g., Leaf, and Spine switches).

This causes suboptimal scheduling: Small jobs scattered across HyperNodes prevent placement of future large jobs.

Describe the solution you'd like

I would like to see additional configuration options added to the existing plugin framework, similar to the Volcano Binpack plugin, to allow users to enable and customize the HyperNode-Level Resource Usage Aware Binpack strategy.

The difference is that, compared to the Binpack plugin in Volcano, the Binpack feature in the Network Topology Aware plugin will focus on the HyperNode dimension at each layer. For example:

graph TD  
  Root("Spine:<br> Capacity={256 CPU, 512Gi Memory, 64 GPU}<br>Idle={192 CPU, 384Gi Memory, 48 GPU}")  
  Root --> Leaf1["Leaf 1:<br> Capacity={128 CPU, 256Gi Memory, 32 GPU}<br>Idle={64 CPU, 128Gi Memory, 16 GPU}"]  
  Root --> Leaf2["Leaf 2: <br>Capacity={128 CPU, 256Gi Memory, 32 GPU}<br>Idle={128 CPU, 256Gi Memory, 32 GPU}"]  
  Leaf1 --> Node1["Node 1:<br> Capacity={64 CPU, 128Gi Memory, 16 GPU}<br>Idle={0 CPU, 0Gi Memory, 0 GPU}"]
  Leaf1 --> Node2["Node 2:<br> Capacity={64 CPU, 128Gi Memory, 16 GPU}<br>Idle={64 CPU, 128Gi Memory, 16 GPU}"]
  Leaf2 --> Node3["Node 3:<br> Capacity={64 CPU, 128Gi Memory, 16 GPU}<br>Idle={64 CPU, 128Gi Memory, 16 GPU}"]
  Leaf2 --> Node4["Node 4:<br> Capacity={64 CPU, 128Gi Memory, 16 GPU}<br>Idle={64 CPU, 128Gi Memory, 16 GPU}"]
  classDef red fill:#ffe6e6;  
  class Node1 red;

This should include parameters such as binpack.enabled, binpack.cpu, binpack.memory, and binpack.resources (with support for custom resources like nvidia.com/gpu).

- name: network-topology-aware
  arguments:
    weight: 10
    binpack.enabled: true
    binpack.cpu: 1
    binpack.memory: 2
    binpack.resources: nvidia.com/gpu
    binpack.resources.nvidia.com/gpu: 10

When hypernode-level resource usage aware binpacking is enabled, the scheduler should perform a global search for the optimal placement strategy. To support this behavior, users should be able to set percentage-nodes-to-find=100 or a similar parameter to ensure that the scheduler evaluates all available nodes and selects the most optimal one based on the defined binpack weights and thresholds.

This plan should work on both hard and soft Mode.

Additional context

Concrete scenarios:

Scenario 1: After placing an 8-node job in Leaf-1 (capacity=16), a new 8-node job should pack into Leaf-1 instead of Leaf-2.
Scenario 2: After placing an 8-node job in Leaf-1, a new 12-node job must use Leaf-2 to avoid topology violations.

graph TD  
  Root("Spine")  
  Root --> Leaf1["Leaf 1: Capacity=16<br>(8 used, 8 free)"]  
  Root --> Leaf2["Leaf 2: Capacity=16<br>(16 free)"]  
  Leaf1 --> Job1["Existing 8-node Job"]

Scenario 1: Packing (8-node job into Leaf-1)

graph TD  
  Root("Spine")  
  Root --> Leaf1["Leaf 1: Capacity=16<br>(8 used, 8 free)"]  
  Root --> Leaf2["Leaf 2: Capacity=16<br>(16 free)"]  
  Leaf1 --> Job1["Existing 8-node Job"]  
  Leaf1 --> Job2["New 8-node Job (PACKED)"]
  classDef red fill:#ffe6e6,stroke:#ff0000;  
  class Job2 red;

Desired: New job packs into Leaf-1 to avoid fragmenting Leaf-2.

Scenario 2: Prevent Topology Violations (12-node job into Leaf-2)

graph TD  
  Root("Spine")  
  Root --> Leaf1["Leaf 1: Capacity=16<br>(8 used, 8 free)"]  
  Root --> Leaf2["Leaf 2: Capacity=16<br>(12 used, 4 free)"]  
  Leaf1 --> Job1["Existing 8-node Job"]  
  Leaf2 --> Job2["New 12-node Job"]
  classDef red fill:#ffe6e6,stroke:#ff0000;  
  class Job2 red;

Desired: New 12-node job uses Leaf-2 to prevent topology violations in Leaf-1.

Jun 12 '25 02:06 MondayCha

/assign

Jun 12 '25 02:06 MondayCha

Very useful future! Thanks! 👍

Jun 13 '25 03:06 JesseStutler

Great!

Jun 23 '25 11:06 kingeasternsun

I have an idea: If a job does not specify a HyperNode policy, it suggests that the job is not sensitive to communication performance. Such job could then be treated as FillJobs, used to fill HyperNode fragments, which aligns with the problem described in the current issue. Conversely, if a task does specify a HyperNode policy, it should follow the existing scheduling flow based on its soft or hard strategy.

Let me illustrate with an example:

Imagine we have four HyperNodes, each with four nodes:

HyperNode1: 1 node remaining
HyperNode2: 1 node remaining
HyperNode3: 4 nodes remaining
HyperNode4: 4 nodes remaining

A user submits a 2-node job (JobA) and an 8-node job (JobB). If JobA is insensitive to communication performance, the ideal scenario would be for JobA to fill the fragments in HyperNode1 and HyperNode2, while JobB utilizes HyperNode4 and HyperNode5.

If the user specifies JobA in hard mode, JobA will occupy nodes within HyperNode3 or HyperNode4. This could lead to JobB being unschedulable or performing poorly due to spanning multiple tiers.

@MondayCha @JesseStutler

Jun 25 '25 09:06 kingeasternsun

@MondayCha Hi, what's the progress of this issue? Did you finish the PR?

Sep 08 '25 02:09 JesseStutler

I have an idea: If a job does not specify a HyperNode policy, it suggests that the job is not sensitive to communication performance. Such job could then be treated as FillJobs, used to fill HyperNode fragments, which aligns with the problem described in the current issue. Conversely, if a task does specify a HyperNode policy, it should follow the existing scheduling flow based on its soft or hard strategy.

Let me illustrate with an example:

Imagine we have four HyperNodes, each with four nodes:

HyperNode1: 1 node remaining

HyperNode2: 1 node remaining

HyperNode3: 4 nodes remaining

HyperNode4: 4 nodes remaining

A user submits a 2-node job (JobA) and an 8-node job (JobB). If JobA is insensitive to communication performance, the ideal scenario would be for JobA to fill the fragments in HyperNode1 and HyperNode2, while JobB utilizes HyperNode4 and HyperNode5.

If the user specifies JobA in hard mode, JobA will occupy nodes within HyperNode3 or HyperNode4. This could lead to JobB being unschedulable or performing poorly due to spanning multiple tiers.

@MondayCha @JesseStutler

Do you mean that even if the job does not specify a network topology policy, the job should still be placed in binpack (hypernode level)? @kingeasternsun

Sep 08 '25 02:09 JesseStutler

@MondayCha Hi, what's the progress of this issue? Did you finish the PR?

I'm sorry, but it's not finished yet; if someone else is working on it, I can cancel the assignment.

Sep 08 '25 02:09 MondayCha

@MondayCha Hi, what's the progress of this issue? Did you finish the PR?

I'm sorry, but it's not finished yet; if someone else is working on it, I can cancel the assignment.

If you already work on this, you can keep finishing it, but I hope we can catch up before 9.30

Sep 08 '25 02:09 JesseStutler

I have an idea: If a job does not specify a HyperNode policy, it suggests that the job is not sensitive to communication performance. Such job could then be treated as FillJobs, used to fill HyperNode fragments, which aligns with the problem described in the current issue. Conversely, if a task does specify a HyperNode policy, it should follow the existing scheduling flow based on its soft or hard strategy.

Let me illustrate with an example:

Imagine we have four HyperNodes, each with four nodes:

HyperNode1: 1 node remaining

HyperNode2: 1 node remaining

HyperNode3: 4 nodes remaining

HyperNode4: 4 nodes remaining

A user submits a 2-node job (JobA) and an 8-node job (JobB). If JobA is insensitive to communication performance, the ideal scenario would be for JobA to fill the fragments in HyperNode1 and HyperNode2, while JobB utilizes HyperNode4 and HyperNode5.

If the user specifies JobA in hard mode, JobA will occupy nodes within HyperNode3 or HyperNode4. This could lead to JobB being unschedulable or performing poorly due to spanning multiple tiers.

@MondayCha @JesseStutler

I submitted my initial implementation, but I did not take this issue into account in the implementation.

In our practice, we ensure that at least a soft topology mode is applied to all jobs. Therefore, if it is a single or small-scale job, it will be used to fill the fragments of super nodes.

Sep 13 '25 02:09 MondayCha

If an existing job happens to occupy a portion of the nodes in leaf 1, then under the node-level binpack, the remaining nodes in leaf 1 and the nodes in leaf 2 will have the same score, so the new job may be scheduled to leaf 2. To avoid this, we need to perform another binpack at the HyeprNode level.

Sep 18 '25 03:09 Monokaix

FillJobs

"FillJobs" you said is another question, we can improve it in another pr if anyone is available.

Sep 18 '25 03:09 Monokaix

@MondayCha Hi, what's the progress of this issue? Did you finish the PR?

I'm sorry, but it's not finished yet; if someone else is working on it, I can cancel the assignment.

If you already work on this, you can keep finishing it, but I hope we can catch up before 9.30

@JesseStutler Hello, I've submitted a pull request for a feature implementation. Could you please review it when you have a moment? #4612

Sep 19 '25 03:09 MondayCha