lightning-thunder
lightning-thunder copied to clipboard
[FSDP] Support optimizer state checkpointing
🚀 Feature
Motivation
Saving the optimizer state is critical to resume a training run.
Pitch
from thunder.distributed.checkpoint import get_optimizer_state_dict, load_optimizer_state_dict
from https://github.com/Lightning-AI/lightning-thunder/blob/main/thunder/distributed/checkpoint.py
Then, integrate it into Fabric's Thunder FSDP strategy
References
get_optimizer_state_dict: https://github.com/pytorch/pytorch/blob/ee557d8f61bbe8a54742a82507a6edb2de3e5a89/torch/distributed/checkpoint/state_dict.py#L662
_optim_state_dict: https://github.com/pytorch/pytorch/blob/ee557d8f61bbe8a54742a82507a6edb2de3e5a89/torch/distributed/fsdp/_optim_utils.py#L1864
Additional context
Follow-up to Lightning-AI/lit-thunder-LEGACY#1909
cc @carmocca @awaelchli @crcrpar