Support task level network topology constrain
What is the problem you're trying to solve
Currently vcjob and PodGroup support setting workload level network topology constrain, we should also support task level network topology setting.
Describe the solution you'd like
In both training and inference scenarios, we don't necessarily need all tasks within the entire job to be restricted to the same HyperNode.
-
In training scenarios, with pipeline parallelism (pp) and data parallelism (dp), it's sufficient for the data parallel (dp) tasks to be distributed within one topology domain.
-
In inference scenarios, such as vllm, the requirement is usually only for the workers to be deployed within the same topology domain, while the leader has no topology constraints.
Additional context
- Should modify Volcano scheduler to support task level topology.
- Should add webhook to validate, e.g., task level HighestTierAllowed should not be greater than job level HighestTierAllowed.
I think this scenario is necessary and I hope to contribute to this feature /assign
/unassign
Due to personal reasons, I will no longer invest in the development of this feature
/assign
This feature will be implemented in v1.14
Still has task #4721, reopen
Thanks for @ouyangshengjia 's great contribution for #4721 solving this issue. We still have these TODO tasks:
- [ ] Add e2e test cases for subgroup
- [ ] Support
minSubGroups - [ ] Support
highestAllowedTierName
Thanks for @ouyangshengjia 's great contribution for #4721 solving this issue. We still have these TODO tasks:
- [ ] Add e2e test cases for subgroup
- [ ] Support
minSubGroups- [ ] Support
highestAllowedTierName
All the tasks on the TODO list have been resolved and merged.