volcano icon indicating copy to clipboard operation
volcano copied to clipboard

Network-topology-aware scheduling optimization: node reordering for tasks.

Open linuxfhy opened this issue 11 months ago • 4 comments

What is the problem you're trying to solve

Background: In AI training Collective Communication scenarios, each pod has a rank number; in many cases, pods will communicate in a ring based on their rank numbers.

One example is: 1.The task includes 5 pods: p1–p5, with data flowing in the direction of p1->p2->p3->p4->p5->p1; 2.There are 5 available nodes in the cluster, distributed across two HyperNodes;

If pods occupy one HyperNode according to their rank number in sequence before occupying the next, then the number of cross-HyperNode communications is 2 times.

Image

If pods are randomly bound to available nodes, there will be more instances of cross-HyperNode communication, which will reduce training efficiency. Image

Describe the solution you'd like

Rearrange the tasks to fill one HyperNode first before occupying the next HyperNode.

Additional context

No response

linuxfhy avatar Apr 25 '25 01:04 linuxfhy

I think it's a good feature. I think we might specify that pods with direct dependencies should be placed under the same hypernode as much as possible

JesseStutler avatar Apr 25 '25 06:04 JesseStutler

There is another way to implement this: Add a controller to update the RANK(whether in pod or confinMap) after pods are scheduled, then there is no necessary to do more things in scheduling process, because every pod has same podTemplate, so it's available to assign a RANK after it's scheduled.

Monokaix avatar May 16 '25 09:05 Monokaix

We have nearly identical scheduling requirements in our production environment. Our internal scheduler already implements similar functionality, operating under the following key assumption:

Nodes within different LEAF HyperNodes reside in distinct IP ranges, while Nodes within the same LEAF HyperNode are in the same, contiguous IP range.

For example, as illustrated above:

  • Nodes under s1: node1 has IP 10.116.200.150, node2 has 10.116.200.151, and node3 has 10.116.200.153.
  • Nodes under s2: node4 has IP 10.116.200.200, and node5 has 10.116.200.201.

If this underlying assumption is acceptable to the community, I can submit our implementation as a Pull Request.

Our implementation only requires extending the Network Topology Aware plugin and doesn't necessitate any changes for user tasks.

kingeasternsun avatar May 23 '25 01:05 kingeasternsun

We have nearly identical scheduling requirements in our production environment. Our internal scheduler already implements similar functionality, operating under the following key assumption:

Nodes within different LEAF HyperNodes reside in distinct IP ranges, while Nodes within the same LEAF HyperNode are in the same, contiguous IP range.

For example, as illustrated above:

  • Nodes under s1: node1 has IP 10.116.200.150, node2 has 10.116.200.151, and node3 has 10.116.200.153.
  • Nodes under s2: node4 has IP 10.116.200.200, and node5 has 10.116.200.201.

If this underlying assumption is acceptable to the community, I can submit our implementation as a Pull Request.

Our implementation only requires extending the Network Topology Aware plugin and doesn't necessitate any changes for user tasks.

It' valuable!

Monokaix avatar May 23 '25 01:05 Monokaix

In fact, network topology aware scheduling is generally based on RDMA network topology. Maybe there is no connection with the front-end network IP ?

We need this feature too.

yccharles avatar Jul 23 '25 13:07 yccharles