torchcv icon indicating copy to clipboard operation
torchcv copied to clipboard

Key mismatch while loading the model?

Open Jumabek opened this issue 6 years ago • 17 comments

I am having issue loading the trained checkpoint to FPNSSD512 model. How can I fix that?

RuntimeError: Error(s) in loading state_dict for FPNSSD512:
	Missing key(s) in state_dict: "fpn.conv1.weight", "fpn.bn1.running_var", "fpn.bn1.bias", "fpn.bn1.running_mean", "fpn.bn1.weight", "fpn.layer1.0.conv1.weight", "fpn.layer1.0.bn1.running_var", "fpn.layer1.0.bn1.bias", "fpn.layer1.0.bn1.running_mean", "fpn.layer1.0.bn1.weight", "fpn.layer1.0.conv2.weight", "fpn.layer1.0.bn2.running_var", "fpn.layer1.0.bn2.bias", 

        Unexpected key(s) in state_dict: "module.fpn.conv1.weight", "module.fpn.bn1.weight", "module.fpn.bn1.bias", "module.fpn.bn1.running_mean", "module.fpn.bn1.running_var", "module.fpn.layer1.0.conv1.weight"

Jumabek avatar May 31 '18 07:05 Jumabek

following code before loading the checkpoint solved the issue

if device == 'cuda':
    net = torch.nn.DataParallel(net)
    cudnn.benchmark = True

Jumabek avatar May 31 '18 08:05 Jumabek

Dear @Jumabek, I have also your reported issue. my script is something like this:

import torch
import torch.backends.cudnn as cudnn
from models.fpnssd.net import FPNSSD512


# Print the PyTorch Version:
print(torch.__version__)  # 0.4.0


# *************** Parameters **************** #
# Check use GPU or not
use_gpu = torch.cuda.is_available()  # use GPU
if use_gpu:
    device = torch.device("cuda:0")  
else:
    device = torch.device("cpu")


# ** Loading Pre-Trained Weights:
net = FPNSSD512(num_classes=20).to(device)
net = torch.nn.DataParallel(net)
cudnn.benchmark = True
# download pre-trained weights from:
# https://drive.google.com/open?id=1yy_kUnm_hZR3uk9yLcaQSMwxVn7wApTU
net.load_state_dict(torch.load('./fpnssd512_20_trained.pth'))
net.eval()

However, I got your reported error. Would you please help me to address this issue?

ahkarami avatar Jun 23 '18 10:06 ahkarami

Dear @kuangliu, Would you please answer my above question?

ahkarami avatar Jun 26 '18 05:06 ahkarami

@ahkarami sorry for late reply. While I do not fully understand the issue. Can you run the code below: I added net = torch.nn.DataParallel(net) after loading the model

import torch
import torch.backends.cudnn as cudnn
from models.fpnssd.net import FPNSSD512


# Print the PyTorch Version:
print(torch.__version__)  # 0.4.0


# *************** Parameters **************** #
# Check use GPU or not
use_gpu = torch.cuda.is_available()  # use GPU
if use_gpu:
    device = torch.device("cuda:0")  
else:
    device = torch.device("cpu")


# ** Loading Pre-Trained Weights:
net = FPNSSD512(num_classes=20).to(device)
net = torch.nn.DataParallel(net)
cudnn.benchmark = True
# download pre-trained weights from:
# https://drive.google.com/open?id=1yy_kUnm_hZR3uk9yLcaQSMwxVn7wApTU
net.load_state_dict(torch.load('./fpnssd512_20_trained.pth'))
net = torch.nn.DataParallel(net)
net.eval()

Jumabek avatar Jul 03 '18 13:07 Jumabek

Dear @Jumabek, Thank you for your reply. Sorry for my inconvenience. I have tested your recommended script, but unfortunately the error is remain. The error is:

Traceback (most recent call last):
  File "/home/user/TorchCV/Attempt1.py", line 54, in <module>
    net.load_state_dict(torch.load('./fpnssd512_20_trained.pth'))
  File "/opt/pytorch4/torch/nn/modules/module.py", line 721, in load_state_dict
    self.__class__.__name__, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for FPNSSD512:
	Missing key(s) in state_dict: "fpn.conv1.weight", "fpn.bn1.running_mean", "fpn.bn1.running_var", ...
	Unexpected key(s) in state_dict: "extractor.conv1.weight", "extractor.bn1.weight", "extractor.bn1.bias", ....

Process finished with exit code 1

It is worth nothing that I have tested the above code on a system which has just one GTX 1080ti GPU (with CUDA 9.0 & cuDNN 7).

ahkarami avatar Jul 04 '18 06:07 ahkarami

Hi I followed your code and seems helped me solve the issue of unexpected key, But I'm wondering what's the reason for it to occur? Why is DataParallel help to solve it?

dearleiii avatar Jul 20 '18 18:07 dearleiii

@ahkarami I meet the same issue with you. Have you fixed it now?

zacario-li avatar Aug 14 '18 03:08 zacario-li

Dear @zacario-li, Unfortunately I couldn't address the issue. I can train & test model by my own GPU (i.e., my trained models are correct) but the released pre-trained model has the above issue. I think the problem related to this fact that the pre-trained model has been trained on a machine with multi GPU but now we want to use it in a machine with just one GPU. However, In this case using the torch.nn.DataParallel(net) command must address the problem, but we saw that this command can't solve the problem!!!

ahkarami avatar Aug 14 '18 06:08 ahkarami

If you want to load the weights after DataParallel use: net.module.load_state_dict(pertained_weights) If you want to load the weights before DataParallel use: net.load_state_dict(pertained_weights)

root-master avatar Oct 10 '18 05:10 root-master

Dear @ahkarami , I think the pretrained fpnssd model provided by @kuangliu is not the same as /models/fpnssd/net.py. Actually, he said that he just replaced vgg16 by fpn50 in ssd512 which is /models/ssd/net.py. So you could not use the model created by /models/fpnssd/net.py to load the wights in /models/ssd/net.py as the keys are not matched. The solution to use his provided pretrained model is to train his ssd512 model with fpn50 not fpnssd512 model in /models/fpnssd/net.py. Also, it seems that he did not put all of his examples on this github or he delete something before pushing.

silkylove avatar Oct 17 '18 15:10 silkylove

Dear @silkylove, Thank you very much for your useful information. Could you load & use his pre-trained network? If yes, would you please release its loading code?

ahkarami avatar Oct 17 '18 18:10 ahkarami

Dear @ahkarami , Ok, I will release the code after I get similar performence compared to his pretrained fpnssd512 model.

silkylove avatar Oct 18 '18 03:10 silkylove

Thank you very much @silkylove.

ahkarami avatar Oct 18 '18 05:10 ahkarami

@ahkarami Please check my code. https://github.com/silkylove/ObjectDetection/tree/master/example/fpnssd I also uploaded the training log with adam with 100 epochs which could get 73.95mAP until now. I am now training SGD with 200 epochs on that which I think would get higher mAP, I will release the training log later. Also, you can uncommen this line in eval.py https://github.com/silkylove/ObjectDetection/blob/master/example/fpnssd/eval.py#L25 to got his pertrained model's performence (about 56mAP). And make sure not to use dataparallel.

silkylove avatar Oct 20 '18 03:10 silkylove

Dear @silkylove, Thank you very much for your time. Your implemented and modified code is really valuable. It would be also great If you upload your pre-trained model (e.g., in Google Drive).

ahkarami avatar Oct 20 '18 16:10 ahkarami

@ahkarami I uploaded the sgd training and eval log. And with sgd, I can only got aound 76% mAP now. The pretrained model was in here.

silkylove avatar Oct 21 '18 03:10 silkylove

@silkylove, Thank you very much.

ahkarami avatar Oct 21 '18 19:10 ahkarami