yolor icon indicating copy to clipboard operation
yolor copied to clipboard

about resume training

Open amaze567 opened this issue 3 years ago • 9 comments

Hello, I have trained a model of yolor-p6 on my dataset in 1000 epochs. However, when I tried to fine-tune the network and loaded the 300 epochs weight, it started to train from zero epoch. Is it normal? Or just I didn't load the old weight. And how can I know if I have successfully loaded the old weight? image

amaze567 avatar Dec 28 '21 04:12 amaze567

As long as you load the "checkpoint.pt" file as your weight, it should be all good. The epochs concern the actual training you are running, so it's normal that they start from zero.

Wazaki-Ou avatar Dec 28 '21 10:12 Wazaki-Ou

@Wazaki-Ou Thanks for your reply. Although I still have a question for it. The loss of the stating epoch of fine-tune training is 0.1604, but the loss of the checkpoint file which I loaded had been trained to 0.02xx. Shouldn't they be the same or not differ too much? That's why I am considering if the program has loaded the checkpoint file.

amaze567 avatar Dec 28 '21 15:12 amaze567

@amaze567 I'm not sure if that's an incorrect behavior to be honest. I hope someone else who has a better understanding of how resume works can help.

Wazaki-Ou avatar Dec 29 '21 07:12 Wazaki-Ou

@Wazaki-Ou OK. Still thanks for your reply. :)

amaze567 avatar Dec 29 '21 08:12 amaze567

@amaze567 I think I have the same issue and the checkpoint did not actually load so it is training on -- weights '' Have you faced any issue when reloading the checkpoint .pt, the epochs do not start at 0? I seem to be having this issue when I load my checkpoint?

Wilbertbh-Tan avatar Jan 03 '22 22:01 Wilbertbh-Tan

@Wilbertbh-Tan Hi, I am still facing the same issue. I tried many times reloading old weights but still trained from zero epoch. Do you have any progress on it?

amaze567 avatar Jan 10 '22 07:01 amaze567

@amaze567 Yes. First ensure your path for the weights is correct. If it isn't it will train from scratch. When you resume training by running train.py, it should resume from where you left off, to change this for fine-tuning, I edited the train.py script to change the starting epoch to what I wanted.

I'm not sure in your case whether the weight file is being created from scratch or it is resuming? Can you verify this by checking the log

Wilbertbh-Tan avatar Jan 27 '22 05:01 Wilbertbh-Tan

我想问一下,恢复训练时学习率会发生变化啊。如何保证延续之前的学习率呢?

qutyyds avatar Jul 10 '22 11:07 qutyyds

會用epoch去schedule裡拿出對應的學習率.

WongKinYiu avatar Jul 10 '22 13:07 WongKinYiu