save model error
It takes long time before saving model at line 426. If I comment this line,
accelerator.wait_for_everyone()
the program can continue to run, but the saved dit_lora parameter is empty.
+1
still get stuck when use multi GPUs in saving model
I try put accelerate.wait_for_everyone() before if accelerator.is_main_process, it works
accelerator.wait_for_everyone() can't put in accelerator.is_main_process you should change train code like this
if accelerator.sync_gradients and global_step % args.checkpointing_steps == 0:
accelerator.wait_for_everyone()
if accelerator.is_main_process:
logger.info(f"saving checkpoint in {global_step=}")
save_path = os.path.join(args.project_dir, f"checkpoint-{global_step}")
os.makedirs(save_path, exist_ok=True)
...
accelerator.wait_for_everyone() can't put in accelerator.is_main_process you should change train code like this
if accelerator.sync_gradients and global_step % args.checkpointing_steps == 0: accelerator.wait_for_everyone() if accelerator.is_main_process: logger.info(f"saving checkpoint in {global_step=}") save_path = os.path.join(args.project_dir, f"checkpoint-{global_step}") os.makedirs(save_path, exist_ok=True) ...
still not work~
Comment accelerator.wait_for_everyone() and revise here:
if accelerator.sync_gradients and global_step % args.checkpointing_steps == 0:
if accelerator.is_main_process: