[FSDPCheckpoint] refactor: enhance FSDP checkpoint manager flexibility
Checklist Before Starting
- [x] Search for similar PR(s).
What does this PR do?
Add one-line overview of what this PR aims to achieve or accomplish.
This PR enables FSDPCheckpointManager to accept optimizer and lr_scheduler as None, removing some existing TODO. Now FSDPCheckpointManager performs saving and loading according to checkpoint_contents, only saving/loading content in checkpoint_contents. This behavior is consistent with MegatronCheckpointManager.
When allowing optimizer and lr_scheduler to be None, we can create an FSDPCheckpointManager for fsdp_module when FSDPWorkers are initialized only for rollout (is_actor==False and is_rollout==True). This allows users to use main_generation.py to directly load FSDP checkpoints without merging them into hf_model.
Also, added save_xx property in the base class to replace all "xx" in checkpoint_contents statements, making the code look better.
High-Level Design
Demonstrate the high-level design if this PR is complex.
Specific Changes
List the specific changes.
API
Demonstrate how the API changes if any.
Usage Example
Provide usage example(s) for easier usage.
# Add code snippet or script demonstrating how to use this
Test
For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluatuion results, etc.
Currently CI should test this PR correctly.
Additional Info.
- Issue Number: Fixes issue # or discussion # if any.
- Training: FSDP
- Inference: VLLM
Checklist Before Submitting
- [x] Read the Contribute Guide.
- [x] Apply pre-commit checks.
- [x] Add
[BREAKING]to the PR title if it breaks any API. - [x] Update the documentation about your changes in the docs.
- [x] Add CI test(s) if neccessary.
Hi @vermouth1992, could u help review this PR?
Hi @0x404 , thanks for your efforts~
Could you consider my suggestion and resolve conflicts?
hi @ETOgaosion, no problem! Quite busy recently, I would revise this as soon as as possible:)
Hi all, Sorry for the late update. I think this PR is ready for review, and I will add some unit tests tomorrow. Hi @ETOgaosion, Should we trigger the CI first to see if we are breaking anything?
@0x404 I add some modification to docs, loggings and saving logic, please review~
@0x404 please double check if there are any missing mistakes?