UNO icon indicating copy to clipboard operation
UNO copied to clipboard

accelerator.wait_for_everyone() stucked

Open xin-ran-w opened this issue 6 months ago • 2 comments

I train the uno on my own dataset, when it save the checkpoint, it stucked.


if accelerator.sync_gradients and accelerator.is_main_process and global_step % args.checkpointing_steps == 0:
        logger.info(f"saving checkpoint in {global_step=}")
        save_path = os.path.join(args.project_dir, f"checkpoint-{global_step}")
        os.makedirs(save_path, exist_ok=True)

        # save
        accelerator.wait_for_everyone()   # this step get stucked
        unwrapped_model = accelerator.unwrap_model(dit)
        unwrapped_model_state = unwrapped_model.state_dict()
        requires_grad_key = [k for k, v in unwrapped_model.named_parameters() if v.requires_grad]
        unwrapped_model_state = {k: unwrapped_model_state[k] for k in requires_grad_key}

xin-ran-w avatar Jun 04 '25 12:06 xin-ran-w

Seemly due to the accelerator.is_main_process have conflict with the accelerator.wait_for_everyone()

xin-ran-w avatar Jun 04 '25 16:06 xin-ran-w

Seemly due to the accelerator.is_main_process have conflict with the accelerator.wait_for_everyone()

May I ask if you have solved this problem? I also have a similar problem.

python-doggg avatar Jun 28 '25 01:06 python-doggg