Race condition when creating checkpoint directories causes training failures
Problem We've encountered a race condition when creating checkpoint directories during DaPo training that causes the process to crash with the following error:
RuntimeError: Parent directory /workspace/ckpts/global_step_20 does not exist.
Environment
- Model: Qwen2.5-32B
- Training Method: DaPo (vanilla)
- Parallelism: FSDP
- Script used: https://github.com/volcengine/verl/blob/main/recipe/dapo/run_dapo_early_qwen2.5_32b.sh
- Environment: Standard from verl/trainer/runtime_env
Error logs
Watch my comments marked with <--.
Tue May 20 21:16:06 2025[1,0]
:RuntimeError: Parent directory /workspace/ckpts/global_step_20 does not exist. <-- failed (controller) Tue May 20 21:16:06 2025[1,0] :[36m(WorkerDict pid=11558, ip=2a02:6b8:c43:5030:0:4457:48a7:6002)[0m [rank-26]: Saving model to /workspace/ckpts/global_step_20/actor/model_world_size_32_rank_26.pt[32m [repeated 31x across cluster][0mTue May 20 21:16:06 2025[1,0] :[36m(WorkerDict pid=11558, ip=2a02:6b8:c43:5030:0:4457:48a7:6002)[0m [rank-26]: Saving checkpoint to /workspace/ckpts/global_step_20/actor/model_world_size_32_rank_26.pt[32m [repeated 31x across cluster][0mTue May 20 21:16:06 2025[1,0] :[36m(WTue May 20 21:16:06 2025[1,0] :orkerDict pid=11558, ip=2a02:6b8:c43:5030:0:4457:48a7:6002)[0m [rank-26]: Saving extra_state to /workspace/ckpts/global_step_20/actor/extra_state_world_size_32_rank_26.pt[32m [repeated 31x across cluster][0m
Possible cause
self.actor_rollout_wg.save_checkpoint creates the folder asynchronously, but doesn't complete in time before the controller saves the data.
Proposed Solution Forcibly create the directory in advance before saving.
PR link
Meet the same problem
Same issue
same issue as well