[bugfix] Force create checkpoint directory before saving dataloader state.
Fix training crash due to missing checkpoint directory
We encountered a training crash with error: "RuntimeError: Parent directory /workspace/ckpts/global_step_20 does not exist".
It appears that self.actor_rollout_wg.save_checkpoint, which should create the checkpoint directory, might be running asynchronously and doesn't complete creating the folder in time.
This change explicitly forces creation of the directory before saving the dataloader state to prevent this race condition.
Checklist Before Starting
- [x] Search for similar PR(s).
What does this PR do?
Add one-line overview of what this PR aims to achieve or accomplish.
High-Level Design
Demonstrate the high-level design if this PR is complex.
Specific Changes
List the specific changes.
API
Demonstrate how the API changes if any.
Usage Example
Provide usage example(s) for easier usage.
# Add code snippet or script demonstrating how to use this
Test
For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluatuion results, etc.
Additional Info.
- Issue Number: 1657
- Training: FSDP/Megatron
- Inference: vLLM
Checklist Before Submitting
- [x] Read the Contribute Guide.
- [x] Apply pre-commit checks.
- [x] Add
[BREAKING]to the PR title if it breaks any API. - [x] Update the documentation about your changes in the docs.
- [x] Add CI test(s) if necessary.
Hi @rj42 , thanks for joining the development~
Can you provide more info about your running scripts, running environment, with FSDP or megatron?
Actually there are 2 cases when saving ckpts:
- To HDFS or NFS systems shared by all ranks
- To local storage of each machine
So your solution can work with the first case but have no effect with the second case. And there is no need to use the thread-safe mkdir in BaseCheckpointManager, as the ray_trainer is the single controller.
Actually you can add a print sentence after the local_mkdir function in both model's save_checkpoint and try to locate the bug more precisely?
You are free to open an issue first for this problem~
Hi @ETOgaosion,
Let me provide more details about our setup:
- scripts: We're running vanilla DaPo on Qwen2.5-32B with our own data using the script at https://github.com/volcengine/verl/blob/main/recipe/dapo/run_dapo_early_qwen2.5_32b.sh
- environment: Standard environment from verl/trainer/runtime_env
- parallelism: We're using FSDP
"And there is no need to use the thread-safe mkdir in BaseCheckpointManager, as the ray_trainer is the single controller."
Regarding your point about thread-safe mkdir in BaseCheckpointManager, as the ray_trainer is the single controller - you're right that ray_trainer is a single controller, but the issue we're seeing is that self.actor_rollout_wg.save_checkpoint can simultaneously create the same folder on the same machine, causing race conditions.
"Actually you can add a print sentence after the local_mkdir function in both model's save_checkpoint and try to locate the bug more precisely?"
We already have implicit prints showing the issue (watch my comments marked with <--):
...
Tue May 20 21:16:06 2025[1,0]<stdout>:RuntimeError: Parent directory /workspace/ckpts/global_step_20 does not exist. <-- failed (controller)
Tue May 20 21:16:06 2025[1,0]<stdout>:[36m(WorkerDict pid=11558, ip=2a02:6b8:c43:5030:0:4457:48a7:6002)[0m [rank-26]: Saving model to /workspace/ckpts/global_step_20/actor/model_world_size_32_rank_26.pt[32m [repeated 31x across cluster][0m
Tue May 20 21:16:06 2025[1,0]<stdout>:[36m(WorkerDict pid=11558, ip=2a02:6b8:c43:5030:0:4457:48a7:6002)[0m [rank-26]: Saving checkpoint to /workspace/ckpts/global_step_20/actor/model_world_size_32_rank_26.pt[32m [repeated 31x across cluster][0m
Tue May 20 21:16:06 2025[1,0]<stdout>:[36m(WTue May 20 21:16:06 2025[1,0]<stdout>:orkerDict pid=11558, ip=2a02:6b8:c43:5030:0:4457:48a7:6002)[0m [rank-26]: Saving extra_state to /workspace/ckpts/global_step_20/actor/extra_state_world_size_32_rank_26.pt[32m [repeated 31x across cluster][0m <-- see here (https://github.com/volcengine/verl/blob/9ddc72520eda03b1a397c1f3249d4788a79b7b4d/verl/utils/checkpoint/fsdp_checkpoint_manager.py#L149)
Would you like me to open an issue first to document this problem before proceeding with the fix?
Hi @ETOgaosion, please check the PR.
https://github.com/volcengine/verl/blob/bb4f97b7541c52936844a4522d6bb6d8570c2c10/verl/trainer/ppo/ray_trainer.py#L846-L856
I checked the code to determine why this error occurred. In my run, L846 executes successfully, which happens in each worker and the directory is automatically created by FSDPCheckpointManager. However, L855 fails because this happens on the Head node, where the directory is not created.
By the way, these files are distributed across temporary directories on different nodes (e.g., ray/session_2025-05-28_11-37-15_954169_4804/runtime_resources/working_dir_files/_ray_pkg_4db734b20e952364/), and it's challenging to merge them together. Should I use HDFS?
@DoubleVII Thanks for adding information. I get what you mean. It seems not a good idea to mess up checkpoint saving in both head and works nodes.
I agree with @rj42 , we may need to consider cases when user start header nodes on CPU only nodes and conduct a local checkpoint storage.
Hi @ETOgaosion, It's been two weeks since this PR was opened: What's the plan for merging it? Seems like it might be stuck? On my end, I can add that my team is already successfully using this fix.
Thanks for reminding~ Last week our CI systems' workloads are heavy and got trouble, will merge when CI finished~