kubedl
kubedl copied to clipboard
[ASoC 2022] Implement native pytorch elastic training fashion based on torch-elastic protocol.
Background:
As the official portal introduced, torch-elastic has been upstreamed to pytorch >=1.9. KubeDL manages the lifecycle of jobs and orchestrate their resources, it is critical to implement torch-elastic distributed training protocol and brings a fault-tolerance & elastic experience, therefore, job completion time(JCT) can be significantly shortened while resources(both cpu/memory and gpus) be better utilized.
Goals to be achieved:
- Design clean & user-friendly elastic training APIs and .
- Implement elastic training control flow on pytorch-controller.
- [Advanced] design a scaling out/in algorithm for user customized metrics.
Additional context:
This issue is part of our ASoC 2022 Program.
Difficulty: Normal Mentor: Qiukai Chen (@SimonCqk )