DeepSpeed icon indicating copy to clipboard operation
DeepSpeed copied to clipboard

SCR checkpoint engine

Open adammoody opened this issue 1 year ago • 2 comments

The existing checkpoint engine model is a close fit for plugging in the Scalable Checkpoint / Restart library, which like Nebula, caches checkpoint files in node-local storage. There are two extensions that could be added that would help. I wanted to open this PR as a means for discussion.

  1. Delegate checkpoint directory creation to CheckpoinEngine

Could we move directory creation into the checkpoint engine, e.g., maybe add a new checkpoint_engine.makedirs() that could fill in for os.makedirs() calls like this one?

https://github.com/microsoft/DeepSpeed/blob/58a4a4d4c19bda86d489ac171fa10f3ddb27c9d6/deepspeed/runtime/pipe/module.py#L590

In some situations, it's useful to delay directory creation until the files are copied. In particular, some checkpoint libraries do not copy every checkpoint, in which case, it's good to avoid creating directories at all.

I opened a separate PR regarding this first request: https://github.com/microsoft/DeepSpeed/pull/2988

  1. Define new open/close calls in CheckpointEngine for reading a checkpoint

Just as the checkpoint sequence has "start" and "end" operations in the form of create() and commit(), it would be useful to have start and end operations for the restart sequence. Maybe new functions like checkpoint_engine.open() and checkpoint_engine.close()?

One would then call checkpoint_engine.load() in between open() and close(). This makes it easier for the library to allow the application to read from temporary paths. For the existing engines, one could move some of the logic from load() to open(). In the case of SCR, one must read all checkpoint files between calls to scr.start_restart() and scr.complete_restart().

A PR adding open and close can be previewed at https://github.com/microsoft/DeepSpeed/pull/3009

With the above two extensions to the CheckpointEngine, this current PR reduces to something like https://github.com/adammoody/DeepSpeed/pull/5 so that SCR-specific code is essentially fully contained to its engine file.

adammoody avatar Mar 08 '23 22:03 adammoody

Hi @adammoody - apologies for the very slow reply on this. Is this a PR that you would still like to see merged? If so, would you be able to - when you have time - merge the latest master and then we can review it?

Let us know, if not we can close this, but we will get the right people to review it after it is updated. Thanks!

loadams avatar Oct 09 '23 19:10 loadams

Thanks, @loadams . Actually, before getting into details on this PR, there is a related PR to add "open/close" operations to the checkpoint engine:

https://github.com/microsoft/DeepSpeed/pull/3009

Could we start with that one?

It also needs to be refreshed, but it would be help if folks could consider the ideas there and let me know whether the approach sounds reasonable.

adammoody avatar Oct 10 '23 22:10 adammoody