DeepSpeed Nvme offload checkpoint

Nvme offload checkpoint

Open eisene opened this issue 1 year ago • 0 comments

Previous PR #4416 had too many issues, closing that one and re-opening. This PR includes a passing test.

This is a proposal for an implementation of checkpointing models when training with ZeRO-3 with NVMe offload:

Currently, the names of the files used in the checkpoint are based on the Python id of the parameter object, which is just the parameter's address in memory. This is not stable across runs, which has two disadvantages:
- The NVMe offloading files grow with every run of the model even if the architecture did not change. This wastes disk space and, at least for me, was a surprise when I first saw it. It is not related to checkpointing.
- Without a way to match the file to the offloaded tensor we can't reload the checkpoint.
We propose an alternative naming scheme. The parameters are named after their ds_id instead of their Python id, and the tensors are named after their state_name and (new) parameter id.
A model checkpoint now has to include all the offloaded tensor files. During checkpoint save/load we copy all the tensor files to/from the "offloaded_tensors" subdirectory of the checkpoint. We provide some logging on the remaining space on the file system due to the potential size of these files, especially as they accumulate in each checkpoint. We do not copy the gradient files.
When loading the checkpoint, the optimizer already has prepared buffers for swapping. We need to purge them so that they are replaced with the freshly copied on-disk buffers from the checkpoint.

The key differences between this PR and the previous one:

There's a test for a simple model with parameter/optimizer offload set to cpu/cpu, cpu/nvme and nvme/nvme.
Gradient files are not copied.
FP16 and FP32 parameter buffers are handled correctly during load.

Nov 20 '23 18:11 eisene