InsightFace_Pytorch code seems do not support resume training from a saved weight?

I was training a model using CASIA-Webface and it stopped in some where of total epochs accidentally. So I've add some lines in Learner.py and tried to resume training but got failed. here is my resuming code:

 def train(self, conf, epochs,resume=False,fixed_str=None):
        self.model.train()
        conf.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        self.model = nn.DataParallel(self.model)
        self.head = nn.DataParallel(self.head)
        self.model.cuda()
        self.head.cuda()
        start_epoch=0
        if resume==True:
            if not fixed_str:
                raise ValueError('must input fixed_str parameter!')
            self.load_state(conf,fixed_str)
            self.step = int(fixed_str.split('_')[-2].split(':')[1])+1
            start_epoch = self.step//len(self.loader)
            self.step = start_epoch*len(self.loader)+1
            print('loading model at epoch {} done!'.format(start_epoch))
            print(self.optimizer)
        running_loss = 0.
        dc_loss = 0.
        bceloss_func = nn.BCELoss()
        for e in range(start_epoch,epochs):
            print('epoch {} started'.format(e))
            if e == self.milestones[0]:
                self.schedule_lr()
            if e == self.milestones[1]:
                self.schedule_lr()
        #nothing changed below

I've changed nothing below. The wired thing is that when I load whichever the weights of all of model, head and optimizer and then continue training, I got a very high CELoss. I tested it in a ipython notebook. When I random initialize a learner, I got CELoss around 45, but when I load a weights(which get a 93% acc on LFW) for the learner I got CELoss around 77. I think the problem lies in the logic of class Arcface in Learner.py but I am not sure. If anyone could help me figure out the issue?

Jan 14 '19 06:01 qq184861643

Resuming training works well for me. I didnt change anything in train method and use Arcface head.

Jan 23 '19 13:01 boomberung

Resuming training works well for me. I didnt change anything in train method and use Arcface head.

@boomberung thx. Then maybe my problem lies in nn.DataParallel. I will try it later.

Jan 24 '19 04:01 qq184861643

@qq184861643 When I change ArcFace to my own head I have same issue like you. Random initial loss is ~45, but when I resume training from weights loss start from ~50 (and lfw acc is 94%).

Feb 05 '19 14:02 boomberung

Hi Unrelated to your question maybe, but i wanted to perform face verification , and the current arcface architecture does not perform really well on my dataset, is it possible to fine tune the model, with my custom dataset?

Thanks in advance

Feb 21 '19 13:02 DecentMakeover

@boomberung Hi! have you figured it out how to solve this? I've tried several methods but still can't fix it

Mar 20 '19 03:03 qq184861643

@DecentMakeover if we can't solve the resuming issue I don't think fine-tuning is possible

Mar 20 '19 03:03 qq184861643

@qq184861643 No, but I found that even with the curve loss function, the network is learning normally. And I think the problem is with this line "loss_board = running_loss / self.board_loss_every"

Apr 02 '19 14:04 boomberung

@qq184861643 @boomberung Have you solved the problem about resuming?

May 11 '19 15:05 LaviLiu

Hi Unrelated to your question maybe, but i wanted to perform face verification , and the current arcface architecture does not perform really well on my dataset, is it possible to fine tune the model, with my custom dataset?

Thanks in advance

Yes, it is possible. But, I can not get high accuracy when training on my custom dataset. Have you got any idea to solve it?

Aug 12 '21 11:08 sangtv9

InsightFace_Pytorch InsightFace_Pytorch copied to clipboard

code seems do not support resume training from a saved weight?

InsightFace_Pytorch
InsightFace_Pytorch copied to clipboard