DeepSpeed icon indicating copy to clipboard operation
DeepSpeed copied to clipboard

[REQUEST] Asynchronous Checkpointing

Open zaptrem opened this issue 1 year ago • 5 comments

Is your feature request related to a problem? Please describe. Checkpointing is significantly faster with Torch Distributed's async checkpoint feature: https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.state_dict_saver.async_save

Blog post: https://pytorch.org/blog/reducing-checkpointing-times/

We want to checkpoint frequently, but it is expensive because it blocks training.

Describe the solution you'd like Checkpoints should load the params to CPU, then save the checkpoint while training continues.

Describe alternatives you've considered Nebula is only for Azure users and is also kinda broken. Torch's FSDP appears to perform worse in general (performance and accuracy) compared to DeepSpeed (likely due to differences in your mixed precision implementations I don't quite understand yet).

zaptrem avatar Jul 02 '24 21:07 zaptrem

Will FastPersist be open-sourced in the next DeepSpeed release ?

cailun01 avatar Jul 15 '24 14:07 cailun01

@zaptrem, thanks for this request. We currently lack bandwidth to add this feature, but it is noted.

tjruwase avatar Aug 03 '24 11:08 tjruwase

Will FastPersist be open-sourced in the next DeepSpeed release ?

@cailun01, yes, we plan to open-source soon.

tjruwase avatar Aug 03 '24 11:08 tjruwase

Hi @tjruwase
I went through the issue and looking to contribute here, though need some time for more clarification and understanding Wanted to know if I can take up this as my first issue here or if you have any suggestions lmk :) Thanks!

Irene-123 avatar Aug 10 '24 18:08 Irene-123

@Irene-123, you are welcome to give it a try. But I suspect this requires non-trivial effort and probably not a good first issue.

@zaptrem, are you able to provide guidance on this?

tjruwase avatar Sep 10 '24 23:09 tjruwase

Will FastPersist be open-sourced in the next DeepSpeed release ?

@cailun01, yes, we plan to open-source soon.

Hi @tjruwase , does FastPersist has been open-sourced? May I ask where could I find source code?

cailun01 avatar Dec 14 '24 04:12 cailun01