UNO icon indicating copy to clipboard operation
UNO copied to clipboard

save model error

Open trikim opened this issue 8 months ago • 6 comments

It takes long time before saving model at line 426. If I comment this line, accelerator.wait_for_everyone() the program can continue to run, but the saved dit_lora parameter is empty.

trikim avatar Apr 17 '25 06:04 trikim

+1

buaawangyu avatar Apr 17 '25 06:04 buaawangyu

still get stuck when use multi GPUs in saving model

Unc1eW4ng avatar Apr 20 '25 23:04 Unc1eW4ng

I try put accelerate.wait_for_everyone() before if accelerator.is_main_process, it works

buaawangyu avatar Apr 21 '25 02:04 buaawangyu

accelerator.wait_for_everyone() can't put in accelerator.is_main_process you should change train code like this

        if accelerator.sync_gradients and global_step % args.checkpointing_steps == 0:
            accelerator.wait_for_everyone()
            if accelerator.is_main_process:

                logger.info(f"saving checkpoint in {global_step=}")
                save_path = os.path.join(args.project_dir, f"checkpoint-{global_step}")
                os.makedirs(save_path, exist_ok=True)
...

luyuhua avatar Apr 22 '25 07:04 luyuhua

accelerator.wait_for_everyone() can't put in accelerator.is_main_process you should change train code like this

        if accelerator.sync_gradients and global_step % args.checkpointing_steps == 0:
            accelerator.wait_for_everyone()
            if accelerator.is_main_process:

                logger.info(f"saving checkpoint in {global_step=}")
                save_path = os.path.join(args.project_dir, f"checkpoint-{global_step}")
                os.makedirs(save_path, exist_ok=True)
...

still not work~

Chalet37 avatar Apr 23 '25 08:04 Chalet37

Comment accelerator.wait_for_everyone() and revise here:

        if accelerator.sync_gradients and global_step % args.checkpointing_steps == 0:
            if accelerator.is_main_process:

scnuhealthy avatar May 21 '25 07:05 scnuhealthy