verl icon indicating copy to clipboard operation
verl copied to clipboard

Race condition when creating checkpoint directories causes training failures

Open rj42 opened this issue 7 months ago • 1 comments

Problem We've encountered a race condition when creating checkpoint directories during DaPo training that causes the process to crash with the following error:

RuntimeError: Parent directory /workspace/ckpts/global_step_20 does not exist.

Environment

  • Model: Qwen2.5-32B
  • Training Method: DaPo (vanilla)
  • Parallelism: FSDP
  • Script used: https://github.com/volcengine/verl/blob/main/recipe/dapo/run_dapo_early_qwen2.5_32b.sh
  • Environment: Standard from verl/trainer/runtime_env

Error logs Watch my comments marked with <--.

Tue May 20 21:16:06 2025[1,0]:RuntimeError: Parent directory /workspace/ckpts/global_step_20 does not exist. <-- failed (controller) Tue May 20 21:16:06 2025[1,0]:[36m(WorkerDict pid=11558, ip=2a02:6b8:c43:5030:0:4457:48a7:6002)[0m [rank-26]: Saving model to /workspace/ckpts/global_step_20/actor/model_world_size_32_rank_26.pt[32m [repeated 31x across cluster][0mTue May 20 21:16:06 2025[1,0]:[36m(WorkerDict pid=11558, ip=2a02:6b8:c43:5030:0:4457:48a7:6002)[0m [rank-26]: Saving checkpoint to /workspace/ckpts/global_step_20/actor/model_world_size_32_rank_26.pt[32m [repeated 31x across cluster][0mTue May 20 21:16:06 2025[1,0]:[36m(WTue May 20 21:16:06 2025[1,0]:orkerDict pid=11558, ip=2a02:6b8:c43:5030:0:4457:48a7:6002)[0m [rank-26]: Saving extra_state to /workspace/ckpts/global_step_20/actor/extra_state_world_size_32_rank_26.pt[32m [repeated 31x across cluster][0m

Possible cause self.actor_rollout_wg.save_checkpoint creates the folder asynchronously, but doesn't complete in time before the controller saves the data.

Proposed Solution Forcibly create the directory in advance before saving.

PR link

rj42 avatar May 23 '25 09:05 rj42

Meet the same problem

DoubleVII avatar May 28 '25 06:05 DoubleVII

Same issue

rinapch avatar Jun 02 '25 21:06 rinapch

same issue as well

EdanToledo avatar Aug 11 '25 19:08 EdanToledo