pytorch-deeplab-xception Resume Training Problem

Resume Training Problem

Open herleeyandi opened this issue 5 years ago • 3 comments

Hello I try to resume my training but looks like the loaded model didn't give the best start. What I mean is for example I am doing training until epoch 50 with mIoU is 0.63. Then I am doing --resume at the last check point. Check point loaded successfully and it begins from epoch 51 with the continued LR. However for the first several epochs accuracy the mIoU is 0.32 to 0.33. This is weird, the model should start from mIoU roughly from 0.60 to 0.63.

Jan 14 '19 04:01 herleeyandi

I think this may be because you have not saved the optimizer parameters

Jan 21 '19 02:01 stillwaterman

@stillwaterman What do you mean by that?, the optimizer parameter should be saved by this code right? in file train.py at line 167-176, especially 'optimizer': self.optimizer.state_dict(),, am I right?

new_pred = mIoU
        if new_pred > self.best_pred:
            is_best = True
            self.best_pred = new_pred
            self.saver.save_checkpoint({
                'epoch': epoch + 1,
                'state_dict': self.model.module.state_dict(),
                'optimizer': self.optimizer.state_dict(),
                'best_pred': self.best_pred,
            }, is_best)

Jan 29 '19 13:01 herleeyandi

@herleeyandi Were you able to find out what the problem was? I'm facing the same issue.

Nov 07 '19 11:11 ashnair1

pytorch-deeplab-xception pytorch-deeplab-xception copied to clipboard

Resume Training Problem

pytorch-deeplab-xception
pytorch-deeplab-xception copied to clipboard