volcano icon indicating copy to clipboard operation
volcano copied to clipboard

Add pod affinity group scheduling capability, to improve the performance of AI training jobs.

Open liuyuanchun11 opened this issue 1 year ago • 3 comments

What would you like to be added:

  1. The affinityGroupSize field is added to the task definition of the Volcano job to describe the number of pods that require affinity scheduling. The following is an example: apiVersion: batch.volcano.sh/v1alpha1 kind: Job metadata: name: test spec: schedulerName: volcano minAvailable: 4 plugins: task-topology: ["--affinity", "[[ps-task, worker-task]]", "--anti-affinity", "[[ps-task]]"] tasks:
    • replicas: 16 name: "ps-task" affinityGroupSize: 4 (# Number of pods for affinity scheduling) ....
  2. The label "volcano.sh/affinity-node-group" is added to the node to identify the nodes that are in the same physical chassis or have higher link performance.

Why is this needed:

In AI training scenarios, the interconnection bandwidth between all NPUs in a physical chassis is higher. When performing distributed training on a large model, fully aligning a distributed parallel policy with a node topology structure can greatly improve performance of large model training. The scheduler allows users to specify the affinity group configuration required by the job when creating an AI job, so that the job can be allocated to the node affinity group in the same physical chassis.

liuyuanchun11 avatar Feb 05 '24 03:02 liuyuanchun11

/assign @liuyuanchun11

Monokaix avatar Mar 07 '24 09:03 Monokaix

@liuyuanchun11 Does task-topology plugin meet you requirement?

lowang-bh avatar Apr 06 '24 01:04 lowang-bh

When performing distributed training on a large model, fully aligning a distributed parallel policy with a node topology structure can greatly improve performance of large model training. The scheduler allows users to specify the affinity group configuration required by the job when creating an AI job, so that the job can be allocated to the node affinity group in the same physical chassis

So your scene is need to put your job in sama physical chassis? How about the network topology plugin in PR #3388, you can try it.

lowang-bh avatar Apr 06 '24 03:04 lowang-bh