volcano icon indicating copy to clipboard operation
volcano copied to clipboard

Support task level network topology constrain

Open Monokaix opened this issue 9 months ago • 6 comments

What is the problem you're trying to solve

Currently vcjob and PodGroup support setting workload level network topology constrain, we should also support task level network topology setting.

Describe the solution you'd like

In both training and inference scenarios, we don't necessarily need all tasks within the entire job to be restricted to the same HyperNode.

  1. In training scenarios, with pipeline parallelism (pp) and data parallelism (dp), it's sufficient for the data parallel (dp) tasks to be distributed within one topology domain.

  2. In inference scenarios, such as vllm, the requirement is usually only for the workers to be deployed within the same topology domain, while the leader has no topology constraints.

Additional context

  1. Should modify Volcano scheduler to support task level topology.
  2. Should add webhook to validate, e.g., task level HighestTierAllowed should not be greater than job level HighestTierAllowed.

Monokaix avatar Apr 15 '25 06:04 Monokaix

I think this scenario is necessary and I hope to contribute to this feature /assign

hwdef avatar Apr 15 '25 06:04 hwdef

/unassign

Due to personal reasons, I will no longer invest in the development of this feature

hwdef avatar Aug 08 '25 02:08 hwdef

/assign

ouyangshengjia avatar Oct 14 '25 03:10 ouyangshengjia

This feature will be implemented in v1.14

JesseStutler avatar Oct 16 '25 08:10 JesseStutler

Still has task #4721, reopen

JesseStutler avatar Nov 18 '25 01:11 JesseStutler

Thanks for @ouyangshengjia 's great contribution for #4721 solving this issue. We still have these TODO tasks:

  • [ ] Add e2e test cases for subgroup
  • [ ] Support minSubGroups
  • [ ] Support highestAllowedTierName

JesseStutler avatar Nov 29 '25 01:11 JesseStutler

Thanks for @ouyangshengjia 's great contribution for #4721 solving this issue. We still have these TODO tasks:

  • [ ] Add e2e test cases for subgroup
  • [ ] Support minSubGroups
  • [ ] Support highestAllowedTierName

All the tasks on the TODO list have been resolved and merged.

ouyangshengjia avatar Dec 18 '25 07:12 ouyangshengjia