pytorch-deeplab-xception icon indicating copy to clipboard operation
pytorch-deeplab-xception copied to clipboard

Resume Training Problem

Open herleeyandi opened this issue 5 years ago • 3 comments

Hello I try to resume my training but looks like the loaded model didn't give the best start. What I mean is for example I am doing training until epoch 50 with mIoU is 0.63. Then I am doing --resume at the last check point. Check point loaded successfully and it begins from epoch 51 with the continued LR. However for the first several epochs accuracy the mIoU is 0.32 to 0.33. This is weird, the model should start from mIoU roughly from 0.60 to 0.63.

herleeyandi avatar Jan 14 '19 04:01 herleeyandi

I think this may be because you have not saved the optimizer parameters

stillwaterman avatar Jan 21 '19 02:01 stillwaterman

@stillwaterman What do you mean by that?, the optimizer parameter should be saved by this code right? in file train.py at line 167-176, especially 'optimizer': self.optimizer.state_dict(),, am I right?

new_pred = mIoU
        if new_pred > self.best_pred:
            is_best = True
            self.best_pred = new_pred
            self.saver.save_checkpoint({
                'epoch': epoch + 1,
                'state_dict': self.model.module.state_dict(),
                'optimizer': self.optimizer.state_dict(),
                'best_pred': self.best_pred,
            }, is_best)

herleeyandi avatar Jan 29 '19 13:01 herleeyandi

@herleeyandi Were you able to find out what the problem was? I'm facing the same issue.

ashnair1 avatar Nov 07 '19 11:11 ashnair1