neuralangelo
neuralangelo copied to clipboard
How to resume my previous training?
From the document I supposed that I should add --resume before rerun my training, but after I start it'll just write train from scratch
torchrun --nproc_per_node=${GPUS} train.py \
--logdir=logs/${GROUP}/${NAME}
--config=${CONFIG}
--show_pbar
--resume
How do I use resume my train exactly? Or it's supposed to write Train from scratch?
Hi @pierreparfait01, has a checkpoint ever been saved? Just adding --resume should make the training resume from the latest checkpoint. Otherwise, you could specify --checkpoint={CHECKPOINT_PATH} as mentioned in the README.
Yes, checkpoint has been saved, but still train from scratch every single time. Even if I use the --checkpoint={CHECKPOINT_PATH} still train from scratch and I tested it's indeed trained all over again.
Edit: So I manage to get it work, for some reason the checkpoint wouldn't load if the commend is placed after --show_pbar. But here comes another issue(bug?) is that if I placed --resume before --show_pbar then the --show_pbar wouldn't load.
Hi, I encountered the same problem as yours and figured out why. you need to add a \ at the end of the last line before you add a new line of --resume