neuralangelo icon indicating copy to clipboard operation
neuralangelo copied to clipboard

How to resume my previous training?

Open pierreparfait01 opened this issue 2 years ago • 3 comments

From the document I supposed that I should add --resume before rerun my training, but after I start it'll just write train from scratch

torchrun --nproc_per_node=${GPUS} train.py \ --logdir=logs/${GROUP}/${NAME}
--config=${CONFIG}
--show_pbar --resume

How do I use resume my train exactly? Or it's supposed to write Train from scratch?

pierreparfait01 avatar Sep 23 '23 16:09 pierreparfait01

Hi @pierreparfait01, has a checkpoint ever been saved? Just adding --resume should make the training resume from the latest checkpoint. Otherwise, you could specify --checkpoint={CHECKPOINT_PATH} as mentioned in the README.

chenhsuanlin avatar Sep 25 '23 04:09 chenhsuanlin

Yes, checkpoint has been saved, but still train from scratch every single time. Even if I use the --checkpoint={CHECKPOINT_PATH} still train from scratch and I tested it's indeed trained all over again.

Edit: So I manage to get it work, for some reason the checkpoint wouldn't load if the commend is placed after --show_pbar. But here comes another issue(bug?) is that if I placed --resume before --show_pbar then the --show_pbar wouldn't load.

pierreparfait01 avatar Sep 25 '23 05:09 pierreparfait01

Hi, I encountered the same problem as yours and figured out why. you need to add a \ at the end of the last line before you add a new line of --resume

Jaydentlee avatar Apr 02 '24 08:04 Jaydentlee