verl Race condition when creating checkpoint directories causes training failures

Problem We've encountered a race condition when creating checkpoint directories during DaPo training that causes the process to crash with the following error:

RuntimeError: Parent directory /workspace/ckpts/global_step_20 does not exist.

Environment

Model: Qwen2.5-32B
Training Method: DaPo (vanilla)
Parallelism: FSDP
Script used: https://github.com/volcengine/verl/blob/main/recipe/dapo/run_dapo_early_qwen2.5_32b.sh
Environment: Standard from verl/trainer/runtime_env

Error logs Watch my comments marked with <--.

Tue May 20 21:16:06 2025[1,0]:RuntimeError: Parent directory /workspace/ckpts/global_step_20 does not exist. <-- failed (controller) Tue May 20 21:16:06 2025[1,0]:[36m(WorkerDict pid=11558, ip=2a02:6b8:c43:5030:0:4457:48a7:6002)[0m [rank-26]: Saving model to /workspace/ckpts/global_step_20/actor/model_world_size_32_rank_26.pt[32m [repeated 31x across cluster][0mTue May 20 21:16:06 2025[1,0]:[36m(WorkerDict pid=11558, ip=2a02:6b8:c43:5030:0:4457:48a7:6002)[0m [rank-26]: Saving checkpoint to /workspace/ckpts/global_step_20/actor/model_world_size_32_rank_26.pt[32m [repeated 31x across cluster][0mTue May 20 21:16:06 2025[1,0]:[36m(WTue May 20 21:16:06 2025[1,0]:orkerDict pid=11558, ip=2a02:6b8:c43:5030:0:4457:48a7:6002)[0m [rank-26]: Saving extra_state to /workspace/ckpts/global_step_20/actor/extra_state_world_size_32_rank_26.pt[32m [repeated 31x across cluster][0m

Possible cause self.actor_rollout_wg.save_checkpoint creates the folder asynchronously, but doesn't complete in time before the controller saves the data.

Proposed Solution Forcibly create the directory in advance before saving.

PR link

May 23 '25 09:05 rj42

Meet the same problem

May 28 '25 06:05 DoubleVII

Same issue

Jun 02 '25 21:06 rinapch

same issue as well

Aug 11 '25 19:08 EdanToledo