Network-topology-aware scheduling optimization: node reordering for tasks.
What is the problem you're trying to solve
Background: In AI training Collective Communication scenarios, each pod has a rank number; in many cases, pods will communicate in a ring based on their rank numbers.
One example is: 1.The task includes 5 pods: p1–p5, with data flowing in the direction of p1->p2->p3->p4->p5->p1; 2.There are 5 available nodes in the cluster, distributed across two HyperNodes;
If pods occupy one HyperNode according to their rank number in sequence before occupying the next, then the number of cross-HyperNode communications is 2 times.
If pods are randomly bound to available nodes, there will be more instances of cross-HyperNode communication, which will reduce training efficiency.
Describe the solution you'd like
Rearrange the tasks to fill one HyperNode first before occupying the next HyperNode.
Additional context
No response
I think it's a good feature. I think we might specify that pods with direct dependencies should be placed under the same hypernode as much as possible
There is another way to implement this: Add a controller to update the RANK(whether in pod or confinMap) after pods are scheduled, then there is no necessary to do more things in scheduling process, because every pod has same podTemplate, so it's available to assign a RANK after it's scheduled.
We have nearly identical scheduling requirements in our production environment. Our internal scheduler already implements similar functionality, operating under the following key assumption:
Nodes within different LEAF HyperNodes reside in distinct IP ranges, while Nodes within the same LEAF HyperNode are in the same, contiguous IP range.
For example, as illustrated above:
- Nodes under s1: node1 has IP
10.116.200.150, node2 has10.116.200.151, and node3 has10.116.200.153. - Nodes under s2: node4 has IP
10.116.200.200, and node5 has10.116.200.201.
If this underlying assumption is acceptable to the community, I can submit our implementation as a Pull Request.
Our implementation only requires extending the Network Topology Aware plugin and doesn't necessitate any changes for user tasks.
We have nearly identical scheduling requirements in our production environment. Our internal scheduler already implements similar functionality, operating under the following key assumption:
Nodes within different LEAF HyperNodes reside in distinct IP ranges, while Nodes within the same LEAF HyperNode are in the same, contiguous IP range.
For example, as illustrated above:
- Nodes under s1: node1 has IP
10.116.200.150, node2 has10.116.200.151, and node3 has10.116.200.153.- Nodes under s2: node4 has IP
10.116.200.200, and node5 has10.116.200.201.If this underlying assumption is acceptable to the community, I can submit our implementation as a Pull Request.
Our implementation only requires extending the Network Topology Aware plugin and doesn't necessitate any changes for user tasks.
It' valuable!
In fact, network topology aware scheduling is generally based on RDMA network topology. Maybe there is no connection with the front-end network IP ?
We need this feature too.