verl icon indicating copy to clipboard operation
verl copied to clipboard

[bugfix] Force create checkpoint directory before saving dataloader state.

Open rj42 opened this issue 7 months ago • 5 comments

Fix training crash due to missing checkpoint directory

We encountered a training crash with error: "RuntimeError: Parent directory /workspace/ckpts/global_step_20 does not exist".

It appears that self.actor_rollout_wg.save_checkpoint, which should create the checkpoint directory, might be running asynchronously and doesn't complete creating the folder in time.

This change explicitly forces creation of the directory before saving the dataloader state to prevent this race condition.

Checklist Before Starting

  • [x] Search for similar PR(s).

What does this PR do?

Add one-line overview of what this PR aims to achieve or accomplish.

High-Level Design

Demonstrate the high-level design if this PR is complex.

Specific Changes

List the specific changes.

API

Demonstrate how the API changes if any.

Usage Example

Provide usage example(s) for easier usage.

# Add code snippet or script demonstrating how to use this 

Test

For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluatuion results, etc.

Additional Info.

  • Issue Number: 1657
  • Training: FSDP/Megatron
  • Inference: vLLM

Checklist Before Submitting

  • [x] Read the Contribute Guide.
  • [x] Apply pre-commit checks.
  • [x] Add [BREAKING] to the PR title if it breaks any API.
  • [x] Update the documentation about your changes in the docs.
  • [x] Add CI test(s) if necessary.

rj42 avatar May 21 '25 18:05 rj42

Hi @rj42 , thanks for joining the development~

Can you provide more info about your running scripts, running environment, with FSDP or megatron?

Actually there are 2 cases when saving ckpts:

  1. To HDFS or NFS systems shared by all ranks
  2. To local storage of each machine

So your solution can work with the first case but have no effect with the second case. And there is no need to use the thread-safe mkdir in BaseCheckpointManager, as the ray_trainer is the single controller.

Actually you can add a print sentence after the local_mkdir function in both model's save_checkpoint and try to locate the bug more precisely?

You are free to open an issue first for this problem~

ETOgaosion avatar May 23 '25 03:05 ETOgaosion

Hi @ETOgaosion,

Let me provide more details about our setup:

  1. scripts: We're running vanilla DaPo on Qwen2.5-32B with our own data using the script at https://github.com/volcengine/verl/blob/main/recipe/dapo/run_dapo_early_qwen2.5_32b.sh
  2. environment: Standard environment from verl/trainer/runtime_env
  3. parallelism: We're using FSDP

"And there is no need to use the thread-safe mkdir in BaseCheckpointManager, as the ray_trainer is the single controller."

Regarding your point about thread-safe mkdir in BaseCheckpointManager, as the ray_trainer is the single controller - you're right that ray_trainer is a single controller, but the issue we're seeing is that self.actor_rollout_wg.save_checkpoint can simultaneously create the same folder on the same machine, causing race conditions.

"Actually you can add a print sentence after the local_mkdir function in both model's save_checkpoint and try to locate the bug more precisely?"

We already have implicit prints showing the issue (watch my comments marked with <--):

...
Tue May 20 21:16:06 2025[1,0]<stdout>:RuntimeError: Parent directory /workspace/ckpts/global_step_20 does not exist.         <-- failed (controller)
Tue May 20 21:16:06 2025[1,0]<stdout>:[36m(WorkerDict pid=11558, ip=2a02:6b8:c43:5030:0:4457:48a7:6002)[0m [rank-26]: Saving model to /workspace/ckpts/global_step_20/actor/model_world_size_32_rank_26.pt[32m [repeated 31x across cluster][0m
Tue May 20 21:16:06 2025[1,0]<stdout>:[36m(WorkerDict pid=11558, ip=2a02:6b8:c43:5030:0:4457:48a7:6002)[0m [rank-26]: Saving checkpoint to /workspace/ckpts/global_step_20/actor/model_world_size_32_rank_26.pt[32m [repeated 31x across cluster][0m
Tue May 20 21:16:06 2025[1,0]<stdout>:[36m(WTue May 20 21:16:06 2025[1,0]<stdout>:orkerDict pid=11558, ip=2a02:6b8:c43:5030:0:4457:48a7:6002)[0m [rank-26]: Saving extra_state to /workspace/ckpts/global_step_20/actor/extra_state_world_size_32_rank_26.pt[32m [repeated 31x across cluster][0m  <-- see here (https://github.com/volcengine/verl/blob/9ddc72520eda03b1a397c1f3249d4788a79b7b4d/verl/utils/checkpoint/fsdp_checkpoint_manager.py#L149)

Would you like me to open an issue first to document this problem before proceeding with the fix?

rj42 avatar May 23 '25 09:05 rj42

Hi @ETOgaosion, please check the PR.

rj42 avatar May 28 '25 07:05 rj42

https://github.com/volcengine/verl/blob/bb4f97b7541c52936844a4522d6bb6d8570c2c10/verl/trainer/ppo/ray_trainer.py#L846-L856

I checked the code to determine why this error occurred. In my run, L846 executes successfully, which happens in each worker and the directory is automatically created by FSDPCheckpointManager. However, L855 fails because this happens on the Head node, where the directory is not created.

By the way, these files are distributed across temporary directories on different nodes (e.g., ray/session_2025-05-28_11-37-15_954169_4804/runtime_resources/working_dir_files/_ray_pkg_4db734b20e952364/), and it's challenging to merge them together. Should I use HDFS?

DoubleVII avatar May 29 '25 04:05 DoubleVII

@DoubleVII Thanks for adding information. I get what you mean. It seems not a good idea to mess up checkpoint saving in both head and works nodes.

I agree with @rj42 , we may need to consider cases when user start header nodes on CPU only nodes and conduct a local checkpoint storage.

ETOgaosion avatar May 29 '25 10:05 ETOgaosion

Hi @ETOgaosion, It's been two weeks since this PR was opened: What's the plan for merging it? Seems like it might be stuck? On my end, I can add that my team is already successfully using this fix.

rj42 avatar Jun 03 '25 13:06 rj42

Thanks for reminding~ Last week our CI systems' workloads are heavy and got trouble, will merge when CI finished~

ETOgaosion avatar Jun 05 '25 06:06 ETOgaosion