QANet-PyTorch icon indicating copy to clipboard operation
QANet-PyTorch copied to clipboard

Resume checkpoint

Open JerryZeyu opened this issue 6 years ago • 6 comments

Why I resume the checkpoint to continue training, the loss is normal but the test result of new epoch is very low like the first epoch? For example, when I have trained the model for 10 epochs. And then I resume the 10th checkpoint and continue training the 11th epoch. When training the 11th epoch, the loss is normal and low. But when the training of 11th epoch is end, the test result of 11th is very low. Like the 1st epoch's result. Can you tell me the reason? Thank you very much.

JerryZeyu avatar Oct 21 '18 18:10 JerryZeyu

@JerryZeyu I also have this phenomenon. I think maybe something in optimizer is not totally saved. However, I haven't figured out the reason.

BangLiu avatar Oct 23 '18 03:10 BangLiu

@BangLiu @JerryZeyu I figured it out! It is caused by EMA. The initialization of ema in QANet_main.py is before resuming the model. The initialization should be moved after the model resuming operation, e.g. after#76 in QANet_trainer.py.

zhangchen010295 avatar Oct 29 '18 08:10 zhangchen010295

Do you mean that the initialization of ema couldn't exist in QANet_main.py? or just move the self.ema = ema after #76 in QANet_trainer.py.? And I also find that the scheduler also influence it. Because the scheduler need to steps()after one epoch and shouldn't steps after every step. Thanks

JerryZeyu avatar Oct 29 '18 20:10 JerryZeyu

@JerryZeyu 'ema.register' should be called after 'self._resume_checkpoint(resume)'

zhangchen010295 avatar Oct 30 '18 01:10 zhangchen010295

I try to use your method. But it also have some problems. For example, if I resume the epoch 10 and continue to train, the performance of epoch 11, 12 ,13,14, ... are the same with the epoch 10 and couldn't be improved. Can you tell me how to solve it? Thank you very much.

JerryZeyu avatar Oct 31 '18 03:10 JerryZeyu

@JerryZeyu It is good for me by just moving the ema from main.py to trainer.py: if resume: self._resume_checkpoint(resume) self.model = self.model.to(self.device) for state in self.optimizer.state.values(): for k, v in state.items(): if isinstance(v, torch.Tensor): state[k] = v.to(self.device) #moved from main.py because the model may be resumed if self.use_ema: for name, param in self.model.named_parameters(): if param.requires_grad: self.ema.register(name, param.data)

zhangchen010295 avatar Oct 31 '18 07:10 zhangchen010295