volcano Network Topology Aware Plugin

/kind feature fixes #2984 fixes #447 fixes #3317 There are several issues request this feature, such as #447 #2984 #3317

Motivation

We target to make scheduler net-topology aware so as to achieve the following:

best effort to schedule same job to same topology devices, such as same idc.

Goals

Support single key topology configuration, try to schedule job's all tasks to nodes which have same value with that key
Support multiple-key topology policies, the key at front get higher score

Non-Goals

Not to find the global solutions among nodes with all kind values of that key

Apr 06 '24 00:04 lowang-bh

/assign @Monokaix @hwdef @william-wang

Apr 07 '24 06:04 lowang-bh

we should talk this in weekly meeting.

Apr 08 '24 06:04 hwdef

Sounds interesting！But it's maybe more complex than the given desgin. For example, network delay varies between different nodes. Also, it varies in different period to the same node. Maybe considering with network performance metrics for this feature will be a good choice. I think we should take a discussion in the community and complete the design first.

Apr 11 '24 02:04 Thor-wl

What's the difference between this and https://github.com/kubernetes-sigs/scheduler-plugins/tree/master/pkg/networkaware？

Apr 11 '24 02:04 Monokaix

Maybe considering with network performance metrics for this feature will be a good choice.

I would like to recommend to treat the network performace metrics as a kind of load, so it's more like a loadaware scheduling.

network delay varies between different nodes

Now it is not considered. This plugin just conserder the physical difference in topology.

Apr 11 '24 06:04 lowang-bh

https://github.com/kubernetes-sigs/scheduler-plugins/tree/master/pkg/networkaware？

I think it is a lite one of that plugin, just consider several physical topology, such as idc, rock, switch, and depend on those labels on nodes.

advantage: more simply to use, just rely on node labels shortcoming：no bandwidth, no latency, etc.

Apr 11 '24 06:04 lowang-bh

https://github.com/kubernetes-sigs/scheduler-plugins/tree/master/pkg/networkaware？

I think it is a lite one of that plugin, just consider several physical topology, such as idc, rock, switch, and depend on those labels on nodes.

advantage: more simply to use, just rely on node labels shortcoming：no bandwidth, no latency, etc.

We should collect more user cases: )

Apr 11 '24 08:04 Monokaix

We have some thoughts on network topology aware scheduling as well, BUT limited in the same IDC.

User Story

Saying we have a 10,000 GPU cluster and each node is connected to a leaf switch through IB or RoCE. Each leaf switch connects to spine switches and there might be multiple paths between leaf switches. The network topology is a fat-tree.

End users would like to submit an LLM training job to the cluster. Typically, the job is an MPI job and the communication between nodes is frequent. In most cases, the communication between nodes in the same leaf switch is faster than the communication between nodes in different leaf switches (fewer jumps, less latency).

End users would like to schedule the tasks to the nodes that have optimal network connectivity at the moment. For example, if the tasks are scheduled to the nodes that are connected to the same leaf switch, the communication between the tasks is faster and the job can finish earlier.

We want to schedule the tasks to the nodes that are connected to the same leaf switch, or at least to the nodes that are connected to the same spine switch. This can reduce the network latency and improve the performance.

Design

We can add a new plugin to the scheduler to support network topology-aware scheduling. The plugin can get the network topology information from the network devices and use the information to schedule the tasks.

Network Topology Information

The network topology information can be stored in a configmap. The information includes the network topology, the network devices, the connections between the devices, and the latency between the devices.

Similar to how to pick the topology-aware nodes in Slurm, we can use certain tools to discover the network topology alive. For example, we can use ibnetdiscover to discover the InfiniBand network topology.

We can have another controller to watch the node changes and spin up a job to discover the network topology on the nodes with IB. The controller can update the network topology information in the configmap.

Network Topology Aware Plugin

The network topology-aware plugin can get the network topology information from the configmap and use the information to schedule the tasks. It works like a score plugin to score the nodes based on the network topology information and pick the best node for a task.

Initial Code Change

Please see: https://github.com/yeahdongcn/volcano/tree/topo

This is a PoC to show how to add a network topology-aware plugin to the scheduler. See https://github.com/yeahdongcn/volcano/blob/topo/pkg/scheduler/actions/allocate/allocate_test.go#L160 for the test case.

If anyone is interested in this topic, we can have further discussion.

@shinytang6 for awareness.

Jun 05 '24 07:06 yeahdongcn

When will this PR be merged?

Jul 24 '24 12:07 wangyizhi1

The first task returning a score of 0 for all nodes might cause issues. I suggest comparing the remaining nodes in each topology to see if they meet the job's requirements and assign a corresponding score. This score should be proportional to the maximum number of tasks that can be accommodated.

Moreover, the current implementation of this plugin resembles a greedy algorithm, which aims to find the optimal node for each task. However, the greedy algorithm doesn't necessarily yield the optimal solution. I'm curious about how dynamic programming or backtracking could be implemented within the Volcano framework. Is there a way to perform multiple pre-scheduling attempts for a job and apply the one with the highest total score?

Jul 29 '24 02:07 wangyizhi1

We have some thoughts on network topology aware scheduling as well, BUT limited in the same IDC.

User Story

Saying we have a 10,000 GPU cluster and each node is connected to a leaf switch through IB or RoCE. Each leaf switch connects to spine switches and there might be multiple paths between leaf switches. The network topology is a fat-tree.

End users would like to submit an LLM training job to the cluster. Typically, the job is an MPI job and the communication between nodes is frequent. In most cases, the communication between nodes in the same leaf switch is faster than the communication between nodes in different leaf switches (fewer jumps, less latency).

End users would like to schedule the tasks to the nodes that have optimal network connectivity at the moment. For example, if the tasks are scheduled to the nodes that are connected to the same leaf switch, the communication between the tasks is faster and the job can finish earlier.

We want to schedule the tasks to the nodes that are connected to the same leaf switch, or at least to the nodes that are connected to the same spine switch. This can reduce the network latency and improve the performance.

Design

We can add a new plugin to the scheduler to support network topology-aware scheduling. The plugin can get the network topology information from the network devices and use the information to schedule the tasks.

Network Topology Information

The network topology information can be stored in a configmap. The information includes the network topology, the network devices, the connections between the devices, and the latency between the devices.

Similar to how to pick the topology-aware nodes in Slurm, we can use certain tools to discover the network topology alive. For example, we can use ibnetdiscover to discover the InfiniBand network topology.

We can have another controller to watch the node changes and spin up a job to discover the network topology on the nodes with IB. The controller can update the network topology information in the configmap.

Network Topology Aware Plugin

The network topology-aware plugin can get the network topology information from the configmap and use the information to schedule the tasks. It works like a score plugin to score the nodes based on the network topology information and pick the best node for a task.

Initial Code Change

Please see: https://github.com/yeahdongcn/volcano/tree/topo

This is a PoC to show how to add a network topology-aware plugin to the scheduler. See https://github.com/yeahdongcn/volcano/blob/topo/pkg/scheduler/actions/allocate/allocate_test.go#L160 for the test case.

If anyone is interested in this topic, we can have further discussion.

@shinytang6 for awareness.

can you elaborate the topology mapping algorithm？

Jul 30 '24 03:07 elinx

Moreover, the current implementation of this plugin resembles a greedy algorithm, which aims to find the optimal node for each task. However, the greedy algorithm doesn't necessarily yield the optimal solution. I'm curious about how dynamic programming or backtracking could be implemented within the Volcano framework. Is there a way to perform multiple pre-scheduling attempts for a job and apply the one with the highest total score?

Yes, current one is just try the best to find a locally optimal solution and is a simple realization. There will be an official release for a global optimal solution.

Jul 30 '24 04:07 lowang-bh

It's a great feature and we ca

Moreover, the current implementation of this plugin resembles a greedy algorithm, which aims to find the optimal node for each task. However, the greedy algorithm doesn't necessarily yield the optimal solution. I'm curious about how dynamic programming or backtracking could be implemented within the Volcano framework. Is there a way to perform multiple pre-scheduling attempts for a job and apply the one with the highest total score?

Yes, current one is just try the best to find a locally optimal solution and is a simple realization. There will be and official release for a global optimal solution.

+1 If we implement it using dynamic programming algorithm, it will be more efficient, but maybe we need modify the allocate action framework, which changes a lot.

Aug 15 '24 08:08 Monokaix

The best way is to use dynamic programming algorithm, but this needs a new design, which may beyond the scope of the issue, v.1.10 is about to release, so this feature can be released v1.11, even it's not a best way, but we can continue to optimize the algorithm: )

Aug 15 '24 08:08 Monokaix

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: To complete the pull request process, please assign hwdef You can assign the PR to them by writing /assign @hwdef in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

Sep 07 '24 03:09 volcano-sh-bot

volcano volcano copied to clipboard

Network Topology Aware Plugin

Motivation

Goals

Non-Goals

User Story

Design

Network Topology Information

Network Topology Aware Plugin

Initial Code Change

User Story

Design

Network Topology Information

Network Topology Aware Plugin

Initial Code Change

volcano
volcano copied to clipboard