lightning-thunder icon indicating copy to clipboard operation
lightning-thunder copied to clipboard

[FSDP] Support optimizer state checkpointing

Open carmocca opened this issue 1 year ago • 0 comments

🚀 Feature

Motivation

Saving the optimizer state is critical to resume a training run.

Pitch

from thunder.distributed.checkpoint import get_optimizer_state_dict, load_optimizer_state_dict

from https://github.com/Lightning-AI/lightning-thunder/blob/main/thunder/distributed/checkpoint.py

Then, integrate it into Fabric's Thunder FSDP strategy

References

get_optimizer_state_dict: https://github.com/pytorch/pytorch/blob/ee557d8f61bbe8a54742a82507a6edb2de3e5a89/torch/distributed/checkpoint/state_dict.py#L662 _optim_state_dict: https://github.com/pytorch/pytorch/blob/ee557d8f61bbe8a54742a82507a6edb2de3e5a89/torch/distributed/fsdp/_optim_utils.py#L1864

Additional context

Follow-up to Lightning-AI/lit-thunder-LEGACY#1909

cc @carmocca @awaelchli @crcrpar

carmocca avatar Mar 05 '24 13:03 carmocca