volcano icon indicating copy to clipboard operation
volcano copied to clipboard

Network Topology Aware Plugin

Open lowang-bh opened this issue 10 months ago • 15 comments

/kind feature fixes #2984 fixes #447 fixes #3317 There are several issues request this feature, such as #447 #2984 #3317

Motivation

We target to make scheduler net-topology aware so as to achieve the following:

  • best effort to schedule same job to same topology devices, such as same idc.

Goals

  • Support single key topology configuration, try to schedule job's all tasks to nodes which have same value with that key
  • Support multiple-key topology policies, the key at front get higher score

Non-Goals

  • Not to find the global solutions among nodes with all kind values of that key

lowang-bh avatar Apr 06 '24 00:04 lowang-bh

/assign @Monokaix @hwdef @william-wang

lowang-bh avatar Apr 07 '24 06:04 lowang-bh

we should talk this in weekly meeting.

hwdef avatar Apr 08 '24 06:04 hwdef

Sounds interesting!But it's maybe more complex than the given desgin. For example, network delay varies between different nodes. Also, it varies in different period to the same node. Maybe considering with network performance metrics for this feature will be a good choice. I think we should take a discussion in the community and complete the design first.

Thor-wl avatar Apr 11 '24 02:04 Thor-wl

What's the difference between this and https://github.com/kubernetes-sigs/scheduler-plugins/tree/master/pkg/networkaware?

Monokaix avatar Apr 11 '24 02:04 Monokaix

Maybe considering with network performance metrics for this feature will be a good choice.

I would like to recommend to treat the network performace metrics as a kind of load, so it's more like a loadaware scheduling.

network delay varies between different nodes

Now it is not considered. This plugin just conserder the physical difference in topology.

lowang-bh avatar Apr 11 '24 06:04 lowang-bh

https://github.com/kubernetes-sigs/scheduler-plugins/tree/master/pkg/networkaware?

I think it is a lite one of that plugin, just consider several physical topology, such as idc, rock, switch, and depend on those labels on nodes.

advantage: more simply to use, just rely on node labels shortcoming:no bandwidth, no latency, etc.

lowang-bh avatar Apr 11 '24 06:04 lowang-bh

https://github.com/kubernetes-sigs/scheduler-plugins/tree/master/pkg/networkaware?

I think it is a lite one of that plugin, just consider several physical topology, such as idc, rock, switch, and depend on those labels on nodes.

advantage: more simply to use, just rely on node labels shortcoming:no bandwidth, no latency, etc.

We should collect more user cases: )

Monokaix avatar Apr 11 '24 08:04 Monokaix

We have some thoughts on network topology aware scheduling as well, BUT limited in the same IDC.

User Story

Saying we have a 10,000 GPU cluster and each node is connected to a leaf switch through IB or RoCE. Each leaf switch connects to spine switches and there might be multiple paths between leaf switches. The network topology is a fat-tree.

End users would like to submit an LLM training job to the cluster. Typically, the job is an MPI job and the communication between nodes is frequent. In most cases, the communication between nodes in the same leaf switch is faster than the communication between nodes in different leaf switches (fewer jumps, less latency).

End users would like to schedule the tasks to the nodes that have optimal network connectivity at the moment. For example, if the tasks are scheduled to the nodes that are connected to the same leaf switch, the communication between the tasks is faster and the job can finish earlier.

We want to schedule the tasks to the nodes that are connected to the same leaf switch, or at least to the nodes that are connected to the same spine switch. This can reduce the network latency and improve the performance.

Design

We can add a new plugin to the scheduler to support network topology-aware scheduling. The plugin can get the network topology information from the network devices and use the information to schedule the tasks.

Network Topology Information

The network topology information can be stored in a configmap. The information includes the network topology, the network devices, the connections between the devices, and the latency between the devices.

Similar to how to pick the topology-aware nodes in Slurm, we can use certain tools to discover the network topology alive. For example, we can use ibnetdiscover to discover the InfiniBand network topology.

We can have another controller to watch the node changes and spin up a job to discover the network topology on the nodes with IB. The controller can update the network topology information in the configmap.

Network Topology Aware Plugin

The network topology-aware plugin can get the network topology information from the configmap and use the information to schedule the tasks. It works like a score plugin to score the nodes based on the network topology information and pick the best node for a task.

Initial Code Change

Please see: https://github.com/yeahdongcn/volcano/tree/topo

This is a PoC to show how to add a network topology-aware plugin to the scheduler. See https://github.com/yeahdongcn/volcano/blob/topo/pkg/scheduler/actions/allocate/allocate_test.go#L160 for the test case.

If anyone is interested in this topic, we can have further discussion.

@shinytang6 for awareness.

yeahdongcn avatar Jun 05 '24 07:06 yeahdongcn

When will this PR be merged?

wangyizhi1 avatar Jul 24 '24 12:07 wangyizhi1

The first task returning a score of 0 for all nodes might cause issues. I suggest comparing the remaining nodes in each topology to see if they meet the job's requirements and assign a corresponding score. This score should be proportional to the maximum number of tasks that can be accommodated.

Moreover, the current implementation of this plugin resembles a greedy algorithm, which aims to find the optimal node for each task. However, the greedy algorithm doesn't necessarily yield the optimal solution. I'm curious about how dynamic programming or backtracking could be implemented within the Volcano framework. Is there a way to perform multiple pre-scheduling attempts for a job and apply the one with the highest total score?

wangyizhi1 avatar Jul 29 '24 02:07 wangyizhi1

We have some thoughts on network topology aware scheduling as well, BUT limited in the same IDC.

User Story

Saying we have a 10,000 GPU cluster and each node is connected to a leaf switch through IB or RoCE. Each leaf switch connects to spine switches and there might be multiple paths between leaf switches. The network topology is a fat-tree.

End users would like to submit an LLM training job to the cluster. Typically, the job is an MPI job and the communication between nodes is frequent. In most cases, the communication between nodes in the same leaf switch is faster than the communication between nodes in different leaf switches (fewer jumps, less latency).

End users would like to schedule the tasks to the nodes that have optimal network connectivity at the moment. For example, if the tasks are scheduled to the nodes that are connected to the same leaf switch, the communication between the tasks is faster and the job can finish earlier.

We want to schedule the tasks to the nodes that are connected to the same leaf switch, or at least to the nodes that are connected to the same spine switch. This can reduce the network latency and improve the performance.

Design

We can add a new plugin to the scheduler to support network topology-aware scheduling. The plugin can get the network topology information from the network devices and use the information to schedule the tasks.

Network Topology Information

The network topology information can be stored in a configmap. The information includes the network topology, the network devices, the connections between the devices, and the latency between the devices.

Similar to how to pick the topology-aware nodes in Slurm, we can use certain tools to discover the network topology alive. For example, we can use ibnetdiscover to discover the InfiniBand network topology.

We can have another controller to watch the node changes and spin up a job to discover the network topology on the nodes with IB. The controller can update the network topology information in the configmap.

Network Topology Aware Plugin

The network topology-aware plugin can get the network topology information from the configmap and use the information to schedule the tasks. It works like a score plugin to score the nodes based on the network topology information and pick the best node for a task.

Initial Code Change

Please see: https://github.com/yeahdongcn/volcano/tree/topo

This is a PoC to show how to add a network topology-aware plugin to the scheduler. See https://github.com/yeahdongcn/volcano/blob/topo/pkg/scheduler/actions/allocate/allocate_test.go#L160 for the test case.

If anyone is interested in this topic, we can have further discussion.

@shinytang6 for awareness.

can you elaborate the topology mapping algorithm?

elinx avatar Jul 30 '24 03:07 elinx

Moreover, the current implementation of this plugin resembles a greedy algorithm, which aims to find the optimal node for each task. However, the greedy algorithm doesn't necessarily yield the optimal solution. I'm curious about how dynamic programming or backtracking could be implemented within the Volcano framework. Is there a way to perform multiple pre-scheduling attempts for a job and apply the one with the highest total score?

Yes, current one is just try the best to find a locally optimal solution and is a simple realization. There will be an official release for a global optimal solution.

lowang-bh avatar Jul 30 '24 04:07 lowang-bh

It's a great feature and we ca

Moreover, the current implementation of this plugin resembles a greedy algorithm, which aims to find the optimal node for each task. However, the greedy algorithm doesn't necessarily yield the optimal solution. I'm curious about how dynamic programming or backtracking could be implemented within the Volcano framework. Is there a way to perform multiple pre-scheduling attempts for a job and apply the one with the highest total score?

Yes, current one is just try the best to find a locally optimal solution and is a simple realization. There will be and official release for a global optimal solution.

+1 If we implement it using dynamic programming algorithm, it will be more efficient, but maybe we need modify the allocate action framework, which changes a lot.

Monokaix avatar Aug 15 '24 08:08 Monokaix

The best way is to use dynamic programming algorithm, but this needs a new design, which may beyond the scope of the issue, v.1.10 is about to release, so this feature can be released v1.11, even it's not a best way, but we can continue to optimize the algorithm: )

Monokaix avatar Aug 15 '24 08:08 Monokaix

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: To complete the pull request process, please assign hwdef You can assign the PR to them by writing /assign @hwdef in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

volcano-sh-bot avatar Sep 07 '24 03:09 volcano-sh-bot