volcano
volcano copied to clipboard
Add pod affinity group scheduling capability, to improve the performance of AI training jobs.
What would you like to be added:
- The affinityGroupSize field is added to the task definition of the Volcano job to describe the number of pods that require affinity scheduling. The following is an example:
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
name: test
spec:
schedulerName: volcano
minAvailable: 4
plugins:
task-topology: ["--affinity", "[[ps-task, worker-task]]", "--anti-affinity", "[[ps-task]]"]
tasks:
- replicas: 16 name: "ps-task" affinityGroupSize: 4 (# Number of pods for affinity scheduling) ....
- The label "volcano.sh/affinity-node-group" is added to the node to identify the nodes that are in the same physical chassis or have higher link performance.
Why is this needed:
In AI training scenarios, the interconnection bandwidth between all NPUs in a physical chassis is higher. When performing distributed training on a large model, fully aligning a distributed parallel policy with a node topology structure can greatly improve performance of large model training. The scheduler allows users to specify the affinity group configuration required by the job when creating an AI job, so that the job can be allocated to the node affinity group in the same physical chassis.
/assign @liuyuanchun11
@liuyuanchun11 Does task-topology plugin meet you requirement?
When performing distributed training on a large model, fully aligning a distributed parallel policy with a node topology structure can greatly improve performance of large model training. The scheduler allows users to specify the affinity group configuration required by the job when creating an AI job, so that the job can be allocated to the node affinity group in the same physical chassis
So your scene is need to put your job in sama physical chassis? How about the network topology plugin in PR #3388, you can try it.