kubedl icon indicating copy to clipboard operation
kubedl copied to clipboard

[ASoC 2022] Implement native pytorch elastic training fashion based on torch-elastic protocol.

Open SimonCqk opened this issue 2 years ago • 0 comments

Background:

As the official portal introduced, torch-elastic has been upstreamed to pytorch >=1.9. KubeDL manages the lifecycle of jobs and orchestrate their resources, it is critical to implement torch-elastic distributed training protocol and brings a fault-tolerance & elastic experience, therefore, job completion time(JCT) can be significantly shortened while resources(both cpu/memory and gpus) be better utilized.

Goals to be achieved:

  • Design clean & user-friendly elastic training APIs and .
  • Implement elastic training control flow on pytorch-controller.
  • [Advanced] design a scaling out/in algorithm for user customized metrics.

Additional context:

This issue is part of our ASoC 2022 Program.

Difficulty: Normal Mentor: Qiukai Chen (@SimonCqk )

SimonCqk avatar May 30 '22 03:05 SimonCqk