mmocr icon indicating copy to clipboard operation
mmocr copied to clipboard

Loss Increases When Resuming From the Previous Trained Model

Open GaoXinJian-USTC opened this issue 3 years ago • 9 comments

I have trained the model 4 epochs, but I find the learning rate inappropriate. Thus I change the learning rate schedule, and then I retrain the network by resuming from the model with trained 3 epochs. I note that the loss of the retraining process increases compared with the loss during the end of the 3 epoch training. I also note that the same problem occurs in the log of your ABINet training. Why did the above bug occur? Does it affect the accuracy of models?

GaoXinJian-USTC avatar Feb 28 '22 08:02 GaoXinJian-USTC

Thanks for catching that - tbh we haven't even noticed the loss change after the resume. We need some time to investigate this issue.

gaotongxiao avatar Feb 28 '22 11:02 gaotongxiao

@ZwwWayne do you have any clues

gaotongxiao avatar Feb 28 '22 11:02 gaotongxiao

@ZwwWayne do you have any clues

I also note that the text recognition accuracy also increases when resuming from a previous epoch without any changes of the training config, although loss increases. It seems that resuming from a previous model may help improve the text recognition accuracy, but I am not sure whether it is an accident.

GaoXinJian-USTC avatar Mar 11 '22 01:03 GaoXinJian-USTC

Thanks for the update. We'll investigate this issue next week.

gaotongxiao avatar Mar 11 '22 02:03 gaotongxiao

I also note that the text recognition accuracy also increases when resuming from a previous epoch without any changes of the training config, although loss increases. It seems that resuming from a previous model may help improve the text recognition accuracy, but I am not sure whether it is an accident.

Can you paste your training log, config, and running command here?

xinke-wang avatar Mar 11 '22 02:03 xinke-wang

I also note that the text recognition accuracy also increases when resuming from a previous epoch without any changes of the training config, although loss increases. It seems that resuming from a previous model may help improve the text recognition accuracy, but I am not sure whether it is an accident.

Can you paste your training log, config, and running command here?

Sorry, the log file is on the encrypted server that I can't export. I just resumed from a previous epoch(3rd in total 6 epochs) and continued to train the same epochs in total(6 epochs), then I found the loss increased compared with the previous 3rd epoch and the accuracy improvements of the epoch 4,5,6 models. It is not clear whether it is an accident.

GaoXinJian-USTC avatar Mar 11 '22 03:03 GaoXinJian-USTC

Thank you for the information. If possible, could you please provide the following info to help us to investigate the issue:

  1. Did you modify any model structure or training config? If so, what changes have you made? If not, which model and config did you use?
  2. Did you train on the private dataset or public dataset supported by MMOCR.

xinke-wang avatar Mar 11 '22 03:03 xinke-wang

Thank you for the information. If possible, could you please provide the following info to help us to investigate the issue:

  1. Did you modify any model structure or training config? If so, what changes have you made? If not, which model and config did you use?
  2. Did you train on the private dataset or public dataset supported by MMOCR.
  1. I change the vision model based on the ABINet and use my customed backbone and parallel attention. Besides, I change the learning rate schedule that the learning rate will drop 1e-1 from the fifth epoch with the initial learning rate 1e-4.
  2. I just train my tailored vision model of ABINet on public MJ and ST datasets.
  3. The same bug that loss increases after resuming also occurs in the log of your official ABINet training.

GaoXinJian-USTC avatar Mar 11 '22 03:03 GaoXinJian-USTC

Thank you for the information. If possible, could you please provide the following info to help us to investigate the issue:

  1. Did you modify any model structure or training config? If so, what changes have you made? If not, which model and config did you use?
  2. Did you train on the private dataset or public dataset supported by MMOCR.
  1. I change the vision model based on the ABINet and use my customed backbone and parallel attention. Besides, I change the learning rate schedule that the learning rate will drop 1e-1 from the fifth epoch with the initial learning rate 1e-4.
  2. I just train my tailored vision model of ABINet on public MJ and ST datasets.
  3. The same bug that loss increases after resuming also occurs in the log of your official ABINet training.

Thank you. I will try to investigate this issue.

xinke-wang avatar Mar 11 '22 03:03 xinke-wang