volcano
volcano copied to clipboard
Network Topology Aware Plugin
/kind feature fixes #2984 fixes #447 fixes #3317 There are several issues request this feature, such as #447 #2984 #3317
Motivation
We target to make scheduler net-topology aware so as to achieve the following:
- best effort to schedule same job to same topology devices, such as same idc.
Goals
- Support single key topology configuration, try to schedule job's all tasks to nodes which have same value with that key
- Support multiple-key topology policies, the key at front get higher score
Non-Goals
- Not to find the global solutions among nodes with all kind values of that key
/assign @Monokaix @hwdef @william-wang
we should talk this in weekly meeting.
Sounds interesting!But it's maybe more complex than the given desgin. For example, network delay varies between different nodes. Also, it varies in different period to the same node. Maybe considering with network performance metrics for this feature will be a good choice. I think we should take a discussion in the community and complete the design first.
What's the difference between this and https://github.com/kubernetes-sigs/scheduler-plugins/tree/master/pkg/networkaware?
Maybe considering with network performance metrics for this feature will be a good choice.
I would like to recommend to treat the network performace metrics as a kind of load, so it's more like a loadaware scheduling.
network delay varies between different nodes
Now it is not considered. This plugin just conserder the physical difference in topology.
https://github.com/kubernetes-sigs/scheduler-plugins/tree/master/pkg/networkaware?
I think it is a lite one of that plugin, just consider several physical topology, such as idc, rock, switch, and depend on those labels on nodes.
advantage: more simply to use, just rely on node labels shortcoming:no bandwidth, no latency, etc.
https://github.com/kubernetes-sigs/scheduler-plugins/tree/master/pkg/networkaware?
I think it is a lite one of that plugin, just consider several physical topology, such as idc, rock, switch, and depend on those labels on nodes.
advantage: more simply to use, just rely on node labels shortcoming:no bandwidth, no latency, etc.
We should collect more user cases: )
We have some thoughts on network topology aware scheduling as well, BUT
limited in the same IDC.
User Story
Saying we have a 10,000 GPU cluster and each node is connected to a leaf switch through IB or RoCE. Each leaf switch connects to spine switches and there might be multiple paths between leaf switches. The network topology is a fat-tree.
End users would like to submit an LLM training job to the cluster. Typically, the job is an MPI job and the communication between nodes is frequent. In most cases, the communication between nodes in the same leaf switch is faster than the communication between nodes in different leaf switches (fewer jumps, less latency).
End users would like to schedule the tasks to the nodes that have optimal network connectivity at the moment. For example, if the tasks are scheduled to the nodes that are connected to the same leaf switch, the communication between the tasks is faster and the job can finish earlier.
We want to schedule the tasks to the nodes that are connected to the same leaf switch, or at least to the nodes that are connected to the same spine switch. This can reduce the network latency and improve the performance.
Design
We can add a new plugin to the scheduler to support network topology-aware scheduling. The plugin can get the network topology information from the network devices and use the information to schedule the tasks.
Network Topology Information
The network topology information can be stored in a configmap. The information includes the network topology, the network devices, the connections between the devices, and the latency between the devices.
Similar to how to pick the topology-aware nodes in Slurm, we can use certain tools to discover the network topology alive. For example, we can use ibnetdiscover
to discover the InfiniBand network topology.
We can have another controller to watch the node changes and spin up a job to discover the network topology on the nodes with IB. The controller can update the network topology information in the configmap.
Network Topology Aware Plugin
The network topology-aware plugin can get the network topology information from the configmap and use the information to schedule the tasks. It works like a score plugin to score the nodes based on the network topology information and pick the best node for a task.
Initial Code Change
Please see: https://github.com/yeahdongcn/volcano/tree/topo
This is a PoC to show how to add a network topology-aware plugin to the scheduler. See https://github.com/yeahdongcn/volcano/blob/topo/pkg/scheduler/actions/allocate/allocate_test.go#L160
for the test case.
If anyone is interested in this topic, we can have further discussion.
@shinytang6 for awareness.
When will this PR be merged?
The first task returning a score of 0 for all nodes might cause issues. I suggest comparing the remaining nodes in each topology to see if they meet the job's requirements and assign a corresponding score. This score should be proportional to the maximum number of tasks that can be accommodated.
Moreover, the current implementation of this plugin resembles a greedy algorithm, which aims to find the optimal node for each task. However, the greedy algorithm doesn't necessarily yield the optimal solution. I'm curious about how dynamic programming or backtracking could be implemented within the Volcano framework. Is there a way to perform multiple pre-scheduling attempts for a job and apply the one with the highest total score?
We have some thoughts on network topology aware scheduling as well,
BUT
limited in the same IDC.User Story
Saying we have a 10,000 GPU cluster and each node is connected to a leaf switch through IB or RoCE. Each leaf switch connects to spine switches and there might be multiple paths between leaf switches. The network topology is a fat-tree.
End users would like to submit an LLM training job to the cluster. Typically, the job is an MPI job and the communication between nodes is frequent. In most cases, the communication between nodes in the same leaf switch is faster than the communication between nodes in different leaf switches (fewer jumps, less latency).
End users would like to schedule the tasks to the nodes that have optimal network connectivity at the moment. For example, if the tasks are scheduled to the nodes that are connected to the same leaf switch, the communication between the tasks is faster and the job can finish earlier.
We want to schedule the tasks to the nodes that are connected to the same leaf switch, or at least to the nodes that are connected to the same spine switch. This can reduce the network latency and improve the performance.
Design
We can add a new plugin to the scheduler to support network topology-aware scheduling. The plugin can get the network topology information from the network devices and use the information to schedule the tasks.
Network Topology Information
The network topology information can be stored in a configmap. The information includes the network topology, the network devices, the connections between the devices, and the latency between the devices.
Similar to how to pick the topology-aware nodes in Slurm, we can use certain tools to discover the network topology alive. For example, we can use
ibnetdiscover
to discover the InfiniBand network topology.We can have another controller to watch the node changes and spin up a job to discover the network topology on the nodes with IB. The controller can update the network topology information in the configmap.
Network Topology Aware Plugin
The network topology-aware plugin can get the network topology information from the configmap and use the information to schedule the tasks. It works like a score plugin to score the nodes based on the network topology information and pick the best node for a task.
Initial Code Change
Please see: https://github.com/yeahdongcn/volcano/tree/topo
This is a PoC to show how to add a network topology-aware plugin to the scheduler. See
https://github.com/yeahdongcn/volcano/blob/topo/pkg/scheduler/actions/allocate/allocate_test.go#L160
for the test case.If anyone is interested in this topic, we can have further discussion.
@shinytang6 for awareness.
can you elaborate the topology mapping algorithm?
Moreover, the current implementation of this plugin resembles a greedy algorithm, which aims to find the optimal node for each task. However, the greedy algorithm doesn't necessarily yield the optimal solution. I'm curious about how dynamic programming or backtracking could be implemented within the Volcano framework. Is there a way to perform multiple pre-scheduling attempts for a job and apply the one with the highest total score?
Yes, current one is just try the best to find a locally optimal solution and is a simple realization. There will be an official release for a global optimal solution.
It's a great feature and we ca
Moreover, the current implementation of this plugin resembles a greedy algorithm, which aims to find the optimal node for each task. However, the greedy algorithm doesn't necessarily yield the optimal solution. I'm curious about how dynamic programming or backtracking could be implemented within the Volcano framework. Is there a way to perform multiple pre-scheduling attempts for a job and apply the one with the highest total score?
Yes, current one is just try the best to find a locally optimal solution and is a simple realization. There will be and official release for a global optimal solution.
+1 If we implement it using dynamic programming algorithm, it will be more efficient, but maybe we need modify the allocate action framework, which changes a lot.
The best way is to use dynamic programming algorithm, but this needs a new design, which may beyond the scope of the issue, v.1.10 is about to release, so this feature can be released v1.11, even it's not a best way, but we can continue to optimize the algorithm: )
[APPROVALNOTIFIER] This PR is NOT APPROVED
This pull-request has been approved by:
To complete the pull request process, please assign hwdef
You can assign the PR to them by writing /assign @hwdef
in a comment when ready.
The full list of commands accepted by this bot can be found here.
Approvers can indicate their approval by writing /approve
in a comment
Approvers can cancel approval by writing /approve cancel
in a comment