verl icon indicating copy to clipboard operation
verl copied to clipboard

[FSDPCheckpoint] refactor: enhance FSDP checkpoint manager flexibility

Open 0x404 opened this issue 8 months ago • 3 comments

Checklist Before Starting

  • [x] Search for similar PR(s).

What does this PR do?

Add one-line overview of what this PR aims to achieve or accomplish.

This PR enables FSDPCheckpointManager to accept optimizer and lr_scheduler as None, removing some existing TODO. Now FSDPCheckpointManager performs saving and loading according to checkpoint_contents, only saving/loading content in checkpoint_contents. This behavior is consistent with MegatronCheckpointManager.

When allowing optimizer and lr_scheduler to be None, we can create an FSDPCheckpointManager for fsdp_module when FSDPWorkers are initialized only for rollout (is_actor==False and is_rollout==True). This allows users to use main_generation.py to directly load FSDP checkpoints without merging them into hf_model.

Also, added save_xx property in the base class to replace all "xx" in checkpoint_contents statements, making the code look better.

High-Level Design

Demonstrate the high-level design if this PR is complex.

Specific Changes

List the specific changes.

API

Demonstrate how the API changes if any.

Usage Example

Provide usage example(s) for easier usage.

# Add code snippet or script demonstrating how to use this 

Test

For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluatuion results, etc.

Currently CI should test this PR correctly.

Additional Info.

  • Issue Number: Fixes issue # or discussion # if any.
  • Training: FSDP
  • Inference: VLLM

Checklist Before Submitting

  • [x] Read the Contribute Guide.
  • [x] Apply pre-commit checks.
  • [x] Add [BREAKING] to the PR title if it breaks any API.
  • [x] Update the documentation about your changes in the docs.
  • [x] Add CI test(s) if neccessary.

0x404 avatar May 01 '25 14:05 0x404

Hi @vermouth1992, could u help review this PR?

0x404 avatar May 09 '25 10:05 0x404

Hi @0x404 , thanks for your efforts~

Could you consider my suggestion and resolve conflicts?

ETOgaosion avatar May 27 '25 05:05 ETOgaosion

hi @ETOgaosion, no problem! Quite busy recently, I would revise this as soon as as possible:)

0x404 avatar May 27 '25 05:05 0x404

Hi all, Sorry for the late update. I think this PR is ready for review, and I will add some unit tests tomorrow. Hi @ETOgaosion, Should we trigger the CI first to see if we are breaking anything?

0x404 avatar Jun 09 '25 14:06 0x404

@0x404 I add some modification to docs, loggings and saving logic, please review~

ETOgaosion avatar Jun 11 '25 04:06 ETOgaosion

@0x404 please double check if there are any missing mistakes?

ETOgaosion avatar Jun 11 '25 11:06 ETOgaosion