Megatron-LM icon indicating copy to clipboard operation
Megatron-LM copied to clipboard

[QUESTION] Status & Plan for Distributed Checkpointing in Megatron Repo

Open thuhujin opened this issue 1 year ago • 5 comments

Hi there, I noticed that distributed checkpointing was recently added in the repo under megatron/core/dist_checkpointing directory. From the current implementation available, I find it a good match for my use case. However, it seems that the code still need some polishing and it's not yet enabled in production code path. I have a few questions regarding the current status and future plans:

  1. Could someone please provide an overall update on the status of the feature and when it might be ready?
  2. Will dist checkpointing be implemented for both transformer engine-based models and non-TE based models? Currently it's only available for TE based models.
  3. I see tensorstore introduced in the changes. Will that be the primary or recommended approach in Megatron repo to deal with checkpoint storage? From some of the papers by Google, it seems to be the case for Google.
  4. There is a similar functionality offered by pytorch. Pytorch also introduced concepts like a load plan, sharded tensor etc. How will the changes in Megatron repo interact with pytorch's?
  5. Finally, is there any roadmap in terms of checkpointing for Megatron? Having clearer plans would make it easier for contributors to align their efforts and make meaningful contributions.

Thanks!

thuhujin avatar Sep 14 '23 09:09 thuhujin

Marking as stale. No activity in 60 days.

github-actions[bot] avatar Nov 13 '23 18:11 github-actions[bot]

Any update on this? Thanks!

thuhujin avatar Nov 14 '23 02:11 thuhujin

Marking as stale. No activity in 60 days.

github-actions[bot] avatar Jan 13 '24 18:01 github-actions[bot]

+1 for this issue, this seems to be an interesting and useful feature.

Besides, sth like async checkpointing is also sth useful, will the megatron team consider implementing this?

toothacher17 avatar Feb 18 '24 08:02 toothacher17

Marking as stale. No activity in 60 days.

github-actions[bot] avatar Apr 18 '24 18:04 github-actions[bot]