yolor
yolor copied to clipboard
about resume training
Hello, I have trained a model of yolor-p6 on my dataset in 1000 epochs. However, when I tried to fine-tune the network and loaded the 300 epochs weight, it started to train from zero epoch. Is it normal? Or just I didn't load the old weight. And how can I know if I have successfully loaded the old weight?
As long as you load the "checkpoint.pt" file as your weight, it should be all good. The epochs concern the actual training you are running, so it's normal that they start from zero.
@Wazaki-Ou Thanks for your reply. Although I still have a question for it. The loss of the stating epoch of fine-tune training is 0.1604, but the loss of the checkpoint file which I loaded had been trained to 0.02xx. Shouldn't they be the same or not differ too much? That's why I am considering if the program has loaded the checkpoint file.
@amaze567 I'm not sure if that's an incorrect behavior to be honest. I hope someone else who has a better understanding of how resume works can help.
@Wazaki-Ou OK. Still thanks for your reply. :)
@amaze567 I think I have the same issue and the checkpoint did not actually load so it is training on
-- weights ''
Have you faced any issue when reloading the checkpoint .pt, the epochs do not start at 0? I seem to be having this issue when I load my checkpoint?
@Wilbertbh-Tan Hi, I am still facing the same issue. I tried many times reloading old weights but still trained from zero epoch. Do you have any progress on it?
@amaze567 Yes. First ensure your path for the weights is correct. If it isn't it will train from scratch. When you resume training by running train.py, it should resume from where you left off, to change this for fine-tuning, I edited the train.py script to change the starting epoch to what I wanted.
I'm not sure in your case whether the weight file is being created from scratch or it is resuming? Can you verify this by checking the log
我想问一下,恢复训练时学习率会发生变化啊。如何保证延续之前的学习率呢?
會用epoch去schedule裡拿出對應的學習率.