DeepSpeed Add DataStates-LLM: Asynchronous Checkpointing Engine Support

We are a team at Argonne National Laboratory working on low-overhead asynchronous checkpointing approaches for LLMs and transformers. As part of these efforts, we have developed DataStates-LLM, a library that we would like to contribute to the DeepSpeed community: https://github.com/datastates/datastates-llm

The key idea we leverage is to allow non-blocking tensor copies during the forward and backward pass from the GPU to the host. Only if these copies do not finish until the update phase, then we block. Meanwhile, from the host memory, the tensors are flushed asynchronously to durable storage (parallel file systems, local SSDs, etc).

To enable this capability, our initial implementation makes the scheduler aware of checkpointing, calling a ckpt.wait() primitive before starting the update phase. We illustrated this with the pipeline scheduler. We are also considering a scheduler-independent solution that integrates with DeepSpeed/Megatron and provides a hook for the start of the update phase, which we can leverage to run ckpt.wait().

We appreciate your feedback and look forward to a collaboration in this space.

Mar 21 '25 19:03 mauryaavinash95

Hi @mauryaavinash95 - could you please run the pre-commit formatter? That should fix the formatting errors at least.

Mar 24 '25 18:03 loadams

Thanks for the feedback @loadams. I've fixed the pre-commit and DCO issues in 1c701d7.

Mar 24 '25 19:03 mauryaavinash95

Thanks for the feedback @loadams. I've fixed the pre-commit and DCO issues in 1c701d7.

Thanks @mauryaavinash95 - The formatting checks look good, DCO shows failing. You can rebase to fix with the command here or if that might cause issues given the complex git history here, we can manually approve the DCO check if you let us know.

Mar 24 '25 19:03 loadams

Thanks for the feedback @loadams. I've fixed the pre-commit and DCO issues in 1c701d7.

Thanks @mauryaavinash95 - The formatting checks look good, DCO shows failing. You can rebase to fix with the command here or if that might cause issues given the complex git history here, we can manually approve the DCO check if you let us know.

I tried using the DCO instructions and this is how I see it on my git log.

commit 1c701d7c61b170eea81dcc637379500a7586b9b2 (HEAD -> dev, origin/dev)
Author: Avinash <[email protected]>
Date:   Mon Mar 24 14:45:11 2025 -0500

    Fix formatting issues for DataStates-LLM
    
    Signed-off-by: Avinash Maurya <[email protected]>

And I think it would be very helpful if you can manually approve the DCO using my email as [email protected].

Mar 24 '25 19:03 mauryaavinash95

Based on the checks now it looks like only the DCO part is pending @loadams. Please let me know if there's anything I can do to fix this quicker than the DeepSpeed team manually approving the DCO.

Mar 24 '25 22:03 mauryaavinash95

@mauryaavinash95, thanks for this great contribution to DeepSpeed. Do you intend to add a tutorial to help users benefit from this feature?

@saforem2, FYI

Mar 25 '25 15:03 tjruwase

@mauryaavinash95, thanks for this great contribution to DeepSpeed. Do you intend to add a tutorial to help users benefit from this feature?

@saforem2, FYI

@tjruwase @saforem2 : yes, we'd like to set up a tutorial for this. Currently, there is just a short snippet to enable it in deepspeed/runtime/checkpoint_engine/README.md. Could you please point us to a reference and repository that we can use for the tutorial?

Mar 27 '25 04:03 mauryaavinash95

@mauryaavinash95, DeepSpeed tutorials appear on the deepspeed.ai:

Listed here: https://www.deepspeed.ai/tutorials/
The docs are edited: https://github.com/deepspeedai/DeepSpeed/tree/master/docs/_tutorials

Mar 27 '25 15:03 tjruwase

@tjruwase I've added the preserves_storage_sharing function for the checkpointing engine; fixed the unwanted commit in deepspeed/runtime/swap_tensor/pipelined_optimizer_swapper.py; and uploaded a tutorial for using DataStates-LLM with DeepSpeed. Commit: 09858a7. Please let me know what you think.

Mar 28 '25 20:03 mauryaavinash95

@tjruwase: I have one more question about the way latest checkpoint version is tracked in DeepSpeed engine.py and Megatron-DeepSpeed engine.

Currently, both assume that checkpoints are synchronously flushed to stable storage by the time the function returns, and they immediately update the tracking files for the latest version. However, this assumption doesn't hold for asynchronous checkpointing, where flushes to slower tiers may still be in progress after the function exits.

Do you have thoughts on how best to handle this? One idea could be to move this responsibility into the checkpointing engine itself, allowing it to manage the timing and semantics of when the latest marker is updated.

Mar 31 '25 16:03 mauryaavinash95

Do you have thoughts on how best to handle this? One idea could be to move this responsibility into the checkpointing engine itself, allowing it to manage the timing and semantics of when the latest marker is updated.

@mauryaavinash95, good question. We handle this in our upcoming code release of FastPersist. The idea is to add a bool decoupled() API to checkpoint engine, where Decoupled is the same as Asynchronous. For Decoupled engines, the logic to commit checkpoints including writing latest is called in the engine.step() before optimizer step is called. Coupled engines utilize the existing logic. If you are not blocked on this, then we can revisit sometime next week when our PR is available to align the APIs.

Apr 01 '25 21:04 tjruwase

@tjruwase Thanks for the feedback. I've updated the PR as per our discussion and moved the logic to debloat the tensors inside checkpointing engines. We can revisit the bool decoupled() API next week once the FastPersist engine PR is in place.

Apr 02 '25 19:04 mauryaavinash95

@mauryaavinash95, can you please look into the CI failures?

Also, it seems we are unable to update the branch.

Apr 09 '25 17:04 tjruwase

@mauryaavinash95, can you please look into the CI failures?

Also, it seems we are unable to update the branch.

@tjruwase Thanks for letting me know. I'll resync with the latest master branch and update the PR within a week. Hopefully, the CI failures should be resolved with it. We don't yet have any Datastates-LLM-specific unit tests, so the checkpointing engine should not fail any other tests, right?

Apr 10 '25 17:04 mauryaavinash95

@mauryaavinash95 - is this ready to be merged?

Apr 18 '25 15:04 loadams

@mauryaavinash95 - is this ready to be merged?

@loadams: I think it is ready to be merged. The one pending thing we have is bool decoupled() API for asynchronous commit, which @tjruwase said we can discuss when the FastPersist engine PR is in place.

Apr 18 '25 17:04 mauryaavinash95

@mauryaavinash95 apologies for the delay on this. Since the FastPersist PR has been merged, do you want to resume this integration? Thanks!

Aug 20 '25 13:08 sfc-gh-truwase

@mauryaavinash95 apologies for the delay on this. Since the FastPersist PR has been merged, do you want to resume this integration? Thanks!

@sfc-gh-truwase, sorry for the delay from our end too- I didn't get notified of this message. We've pushed an updated version based on the decoupled checkpointing engine. It also includes our changes regarding preserves_storage_sharing setting that we discussed previously.

Oct 04 '25 01:10 mauryaavinash95

Hi @sfc-gh-truwase, just wanted to check if you had any feedback on this PR.

Oct 10 '25 20:10 mauryaavinash95

Hi @sfc-gh-truwase, just wanted to check if you had any feedback on this PR.

@mauryaavinash95 I will take a close look next week! Thanks for the reminder.

Oct 10 '25 21:10 sfc-gh-truwase

@mauryaavinash95, I approved with a suggested change.

Oct 15 '25 20:10 sfc-gh-truwase

@mauryaavinash95, I approved with a suggested change.

@sfc-gh-truwase, thanks for the update. I've made a minor change in exporting the datastates engine, and it looks good from our end.

Oct 16 '25 20:10 mauryaavinash95

@mauryaavinash95 can you rebase the branch? For some reason, I am unable to do that.

Oct 17 '25 16:10 sfc-gh-truwase

@mauryaavinash95 can you rebase the branch? For some reason, I am unable to do that.

Hi @sfc-gh-truwase, I've rebased it- can you please check again? Thanks.

Oct 18 '25 15:10 mauryaavinash95