Add DataStates-LLM: Asynchronous Checkpointing Engine Support
We are a team at Argonne National Laboratory working on low-overhead asynchronous checkpointing approaches for LLMs and transformers. As part of these efforts, we have developed DataStates-LLM, a library that we would like to contribute to the DeepSpeed community: https://github.com/datastates/datastates-llm
The key idea we leverage is to allow non-blocking tensor copies during the forward and backward pass from the GPU to the host. Only if these copies do not finish until the update phase, then we block. Meanwhile, from the host memory, the tensors are flushed asynchronously to durable storage (parallel file systems, local SSDs, etc).
To enable this capability, our initial implementation makes the scheduler aware of checkpointing, calling a ckpt.wait() primitive before starting the update phase. We illustrated this with the pipeline scheduler. We are also considering a scheduler-independent solution that integrates with DeepSpeed/Megatron and provides a hook for the start of the update phase, which we can leverage to run ckpt.wait().
We appreciate your feedback and look forward to a collaboration in this space.
Hi @mauryaavinash95 - could you please run the pre-commit formatter? That should fix the formatting errors at least.
Thanks for the feedback @loadams. I've fixed the pre-commit and DCO issues in 1c701d7.
Thanks for the feedback @loadams. I've fixed the pre-commit and DCO issues in 1c701d7.
Thanks @mauryaavinash95 - The formatting checks look good, DCO shows failing. You can rebase to fix with the command here or if that might cause issues given the complex git history here, we can manually approve the DCO check if you let us know.
Thanks for the feedback @loadams. I've fixed the pre-commit and DCO issues in 1c701d7.
Thanks @mauryaavinash95 - The formatting checks look good, DCO shows failing. You can rebase to fix with the command here or if that might cause issues given the complex git history here, we can manually approve the DCO check if you let us know.
I tried using the DCO instructions and this is how I see it on my git log.
commit 1c701d7c61b170eea81dcc637379500a7586b9b2 (HEAD -> dev, origin/dev)
Author: Avinash <[email protected]>
Date: Mon Mar 24 14:45:11 2025 -0500
Fix formatting issues for DataStates-LLM
Signed-off-by: Avinash Maurya <[email protected]>
And I think it would be very helpful if you can manually approve the DCO using my email as [email protected].
Based on the checks now it looks like only the DCO part is pending @loadams. Please let me know if there's anything I can do to fix this quicker than the DeepSpeed team manually approving the DCO.
@mauryaavinash95, thanks for this great contribution to DeepSpeed. Do you intend to add a tutorial to help users benefit from this feature?
@saforem2, FYI
@mauryaavinash95, thanks for this great contribution to DeepSpeed. Do you intend to add a tutorial to help users benefit from this feature?
@saforem2, FYI
@tjruwase @saforem2 : yes, we'd like to set up a tutorial for this. Currently, there is just a short snippet to enable it in deepspeed/runtime/checkpoint_engine/README.md. Could you please point us to a reference and repository that we can use for the tutorial?
@mauryaavinash95, DeepSpeed tutorials appear on the deepspeed.ai:
- Listed here: https://www.deepspeed.ai/tutorials/
- The docs are edited: https://github.com/deepspeedai/DeepSpeed/tree/master/docs/_tutorials
@tjruwase I've added the preserves_storage_sharing function for the checkpointing engine; fixed the unwanted commit in deepspeed/runtime/swap_tensor/pipelined_optimizer_swapper.py; and uploaded a tutorial for using DataStates-LLM with DeepSpeed. Commit: 09858a7. Please let me know what you think.
@tjruwase: I have one more question about the way latest checkpoint version is tracked in DeepSpeed engine.py and Megatron-DeepSpeed engine.
Currently, both assume that checkpoints are synchronously flushed to stable storage by the time the function returns, and they immediately update the tracking files for the latest version. However, this assumption doesn't hold for asynchronous checkpointing, where flushes to slower tiers may still be in progress after the function exits.
Do you have thoughts on how best to handle this? One idea could be to move this responsibility into the checkpointing engine itself, allowing it to manage the timing and semantics of when the latest marker is updated.
Do you have thoughts on how best to handle this? One idea could be to move this responsibility into the checkpointing engine itself, allowing it to manage the timing and semantics of when the
latestmarker is updated.
@mauryaavinash95, good question. We handle this in our upcoming code release of FastPersist. The idea is to add a bool decoupled() API to checkpoint engine, where Decoupled is the same as Asynchronous. For Decoupled engines, the logic to commit checkpoints including writing latest is called in the engine.step() before optimizer step is called. Coupled engines utilize the existing logic. If you are not blocked on this, then we can revisit sometime next week when our PR is available to align the APIs.
@tjruwase Thanks for the feedback. I've updated the PR as per our discussion and moved the logic to debloat the tensors inside checkpointing engines.
We can revisit the bool decoupled() API next week once the FastPersist engine PR is in place.
@mauryaavinash95, can you please look into the CI failures?
Also, it seems we are unable to update the branch.
@mauryaavinash95, can you please look into the CI failures?
Also, it seems we are unable to update the branch.
@tjruwase Thanks for letting me know. I'll resync with the latest master branch and update the PR within a week. Hopefully, the CI failures should be resolved with it. We don't yet have any Datastates-LLM-specific unit tests, so the checkpointing engine should not fail any other tests, right?
@mauryaavinash95 - is this ready to be merged?
@mauryaavinash95 - is this ready to be merged?
@loadams: I think it is ready to be merged. The one pending thing we have is bool decoupled() API for asynchronous commit, which @tjruwase said we can discuss when the FastPersist engine PR is in place.
@mauryaavinash95 apologies for the delay on this. Since the FastPersist PR has been merged, do you want to resume this integration? Thanks!
@mauryaavinash95 apologies for the delay on this. Since the FastPersist PR has been merged, do you want to resume this integration? Thanks!
@sfc-gh-truwase, sorry for the delay from our end too- I didn't get notified of this message. We've pushed an updated version based on the decoupled checkpointing engine. It also includes our changes regarding preserves_storage_sharing setting that we discussed previously.
Hi @sfc-gh-truwase, just wanted to check if you had any feedback on this PR.
Hi @sfc-gh-truwase, just wanted to check if you had any feedback on this PR.
@mauryaavinash95 I will take a close look next week! Thanks for the reminder.
@mauryaavinash95, I approved with a suggested change.
@mauryaavinash95, I approved with a suggested change.
@sfc-gh-truwase, thanks for the update. I've made a minor change in exporting the datastates engine, and it looks good from our end.
@mauryaavinash95 can you rebase the branch? For some reason, I am unable to do that.
@mauryaavinash95 can you rebase the branch? For some reason, I am unable to do that.
Hi @sfc-gh-truwase, I've rebased it- can you please check again? Thanks.