UNO
UNO copied to clipboard
accelerator.wait_for_everyone() stucked
I train the uno on my own dataset, when it save the checkpoint, it stucked.
if accelerator.sync_gradients and accelerator.is_main_process and global_step % args.checkpointing_steps == 0:
logger.info(f"saving checkpoint in {global_step=}")
save_path = os.path.join(args.project_dir, f"checkpoint-{global_step}")
os.makedirs(save_path, exist_ok=True)
# save
accelerator.wait_for_everyone() # this step get stucked
unwrapped_model = accelerator.unwrap_model(dit)
unwrapped_model_state = unwrapped_model.state_dict()
requires_grad_key = [k for k, v in unwrapped_model.named_parameters() if v.requires_grad]
unwrapped_model_state = {k: unwrapped_model_state[k] for k in requires_grad_key}
Seemly due to the accelerator.is_main_process have conflict with the accelerator.wait_for_everyone()
Seemly due to the accelerator.is_main_process have conflict with the accelerator.wait_for_everyone()
May I ask if you have solved this problem? I also have a similar problem.