Avinash Maurya
Avinash Maurya
Thanks for the feedback @loadams. I've fixed the pre-commit and DCO issues in [1c701d7](https://github.com/deepspeedai/DeepSpeed/pull/7166/commits/1c701d7c61b170eea81dcc637379500a7586b9b2).
> > Thanks for the feedback @loadams. I've fixed the pre-commit and DCO issues in [1c701d7](https://github.com/deepspeedai/DeepSpeed/pull/7166/commits/1c701d7c61b170eea81dcc637379500a7586b9b2). > > Thanks @mauryaavinash95 - The formatting checks look good, DCO shows failing. You...
Based on the checks now it looks like only the DCO part is pending @loadams. Please let me know if there's anything I can do to fix this quicker than...
> @mauryaavinash95, thanks for this great contribution to DeepSpeed. Do you intend to add a tutorial to help users benefit from this feature? > > @saforem2, FYI @tjruwase @saforem2 :...
@tjruwase I've added the preserves_storage_sharing function for the checkpointing engine; fixed the unwanted commit in deepspeed/runtime/swap_tensor/pipelined_optimizer_swapper.py; and uploaded a tutorial for using DataStates-LLM with DeepSpeed. Commit: 09858a7. Please let me...
@tjruwase: I have one more question about the way `latest` checkpoint version is tracked in [DeepSpeed engine.py](https://github.com/deepspeedai/DeepSpeed/blob/master/deepspeed/runtime/engine.py#L3333) and [Megatron-DeepSpeed engine](https://github.com/deepspeedai/Megatron-DeepSpeed/blob/main/megatron/checkpointing.py#L317). Currently, both assume that checkpoints are synchronously flushed to stable...
@tjruwase Thanks for the feedback. I've updated the PR as per our discussion and moved the logic to debloat the tensors inside checkpointing engines. We can revisit the `bool decoupled()`...
> @mauryaavinash95, can you please look into the CI failures? > > Also, it seems we are unable to update the branch. @tjruwase Thanks for letting me know. I'll resync...
> @mauryaavinash95 - is this ready to be merged? @loadams: I think it is ready to be merged. The one pending thing we have is `bool decoupled()` API for asynchronous...
> @mauryaavinash95 apologies for the delay on this. Since the FastPersist PR has been merged, do you want to resume this integration? Thanks! @sfc-gh-truwase, sorry for the delay from our...