faster-rcnn.pytorch
faster-rcnn.pytorch copied to clipboard
Runtime Error when resuming training
I was trainig using multiple GPUs on my own dataset, but when resuming training, I got this error
Loading pretrained weights from data/pretrained_model/vgg16_caffe.pth
loading checkpoint models/vgg16/virtual_sign_2019/faster_rcnn_1_3_1124.pth
loaded checkpoint models/vgg16/virtual_sign_2019/faster_rcnn_1_3_1124.pth
Traceback (most recent call last):
File "trainval_net.py", line 340, in <module>
optimizer.step()
File "/home/sy1806701/anaconda3/envs/pytorch/lib/python3.5/site-packages/torch/optim/sgd.py", line 101, in step
buf.mul_(momentum).add_(1 - dampening, d_p)
RuntimeError: The expanded size of the tensor (3) must match the existing size (25088) at non-singleton dimension 3
Environment: Pytorch 0.4.0 CUDA 9.0 cuDNN 7.1.2 Python 3.5 GPUs: 4 x Tesla V100
Command line I used:
CUDA_VISIBLE_DEVICES=2,3,4,5 python trainval_net.py --dataset virtual_sign_2019 --net vgg16 --bs 32 --nw 16 --lr 0.001 --cuda --mGPUs --r True --checksession 1 --checkepoch 3 --checkpoint 1124
I have tried everything I can to solve this problem, incluing many issue related to this, like #515 #475 #506 , but the problem still exists....... is there any possilbe solution? thanks....
Do you experience the same error when training on a single GPU, and then when resuming training on a single GPU?
Do you experience the same error when training on a single GPU, and then when resuming training on a single GPU?
Yes, when using single GPU, the same Runtime Error occurs.
I guess it's caused by
optimizer.load_state_dict(checkpoint['optimizer'])
in trainval_net.py while resuming training, because the error message points to "optimizer.step()" everytime. I tried to comment this two lines in trainval_net.py:
https://github.com/jwyang/faster-rcnn.pytorch/blob/0797f6290e104e7d63cd487af759840d4a36985b/trainval_net.py#L283
# optimizer.load_state_dict(checkpoint['optimizer'])
# lr = optimizer.param_groups[0]['lr']
which means the saved state of optimizer wont't be loaded when resuming training, and actually it works, the Runtme Error never occurs again, and the training goes on. But I've no idea whether this is the right solution, and whether it will affect later traning prosess, neither do I know what on earth caused this problem.......
I haven't been able to recreate your issue. Could you please send me the errors you get for 1 GPU and multiple GPUs?
Can you send me a snippet of the code you're using from
fasterRCNN.create_architecture()
till ...
if args.use_tfboard:
from tensorboardX import SummaryWriter
logger = SummaryWriter("logs")
Maybe I can spot an abnormality
Sure. The errors I got for 1 GPU and multiple GPUs are the same, and both of them are as my description above, which is
Loading pretrained weights from data/pretrained_model/vgg16_caffe.pth
loading checkpoint models/vgg16/virtual_sign_2019/faster_rcnn_1_3_1124.pth
loaded checkpoint models/vgg16/virtual_sign_2019/faster_rcnn_1_3_1124.pth
Traceback (most recent call last):
File "trainval_net.py", line 340, in <module>
optimizer.step()
File "/home/sy1806701/anaconda3/envs/pytorch/lib/python3.5/site-packages/torch/optim/sgd.py", line 101, in step
buf.mul_(momentum).add_(1 - dampening, d_p)
RuntimeError: The expanded size of the tensor (3) must match the existing size (25088) at non-singleton dimension 3
The code snippet I‘m using is
fasterRCNN.create_architecture()
lr = cfg.TRAIN.LEARNING_RATE
lr = args.lr
# tr_momentum = cfg.TRAIN.MOMENTUM
# tr_momentum = args.momentum
params = []
for key, value in dict(fasterRCNN.named_parameters()).items():
if value.requires_grad:
if 'bias' in key:
params += [{'params': [value], 'lr': lr * (cfg.TRAIN.DOUBLE_BIAS + 1), \
'weight_decay': cfg.TRAIN.BIAS_DECAY and cfg.TRAIN.WEIGHT_DECAY or 0}]
else:
params += [{'params': [value], 'lr': lr, 'weight_decay': cfg.TRAIN.WEIGHT_DECAY}]
if args.optimizer == "adam":
lr = lr * 0.1
optimizer = torch.optim.Adam(params)
elif args.optimizer == "sgd":
optimizer = torch.optim.SGD(params, momentum=cfg.TRAIN.MOMENTUM)
if args.cuda:
fasterRCNN.cuda()
if args.resume:
load_name = os.path.join(output_dir,
'faster_rcnn_{}_{}_{}.pth'.format(args.checksession, args.checkepoch, args.checkpoint))
print("loading checkpoint %s" % (load_name))
checkpoint = torch.load(load_name)
args.session = checkpoint['session']
args.start_epoch = checkpoint['epoch']
fasterRCNN.load_state_dict(checkpoint['model'])
optimizer.load_state_dict(checkpoint['optimizer'])
lr = optimizer.param_groups[0]['lr']
if 'pooling_mode' in checkpoint.keys():
cfg.POOLING_MODE = checkpoint['pooling_mode']
print("loaded checkpoint %s" % (load_name))
if args.mGPUs:
fasterRCNN = nn.DataParallel(fasterRCNN)
iters_per_epoch = int(train_size / args.batch_size)
if args.use_tfboard:
from tensorboardX import SummaryWriter
logger = SummaryWriter("logs")
and by the way, the code actually wasn't modified, and is exactly the same as the master branch of this repo. Thanks~
Could you try and modify your code as suggested in this comment
It has been merged into the pytorch-1.0 branch, but not the main branch. Maybe it will solve your problem as well.
Could you try and modify your code as suggested in this comment
It has been merged into the pytorch-1.0 branch, but not the main branch. Maybe it will solve your problem as well.
Thanks... I've tried this code, I changed pytorch to 1.0 and used the 1.0 branch, and then moved
if args.cuda: fasterRCNN.cuda()
above the assignment of the optimizer(which has been done in the pytorch-1.0 branch), but when resuming training, the problem still exist...
And you still get the same error?
Everything should work pretty much out-of-the-box; git pull and run.. As a sanity check, have you tried to simply run everything normally (not resuming, single gpu, etc.)? And then resuming with default parameters? Steadily working your way up to the full version of what you want to run.
EDIT: What version of torchvision are you using?
Yes... if I just use the out-of-the-box code, everything works just fine when traing from scratch, either on single or multiple GPUs, but the error always occurs when resuming training...
But as I descripted before, if I comment these two lines
# optimizer.load_state_dict(checkpoint['optimizer'])
# lr = optimizer.param_groups[0]['lr']
in trainvla_net.py when resuming training, the training process goes on normally, and I've tested the modified code on my own dataset, the loss converged normally and the mAP I got from test sets was also acceptable.
I'm using SGD optimizer, so it seems that there isn't any adverse effect so far if I don't load the state dict of this opptimizer when resuming training. But it remains to be varified that if it has any negative effect on other optimizers like Adam.
I'm using torchvision 0.2.1(Build py35_1).
And you still get the same error?
Everything should work pretty much out-of-the-box; git pull and run.. As a sanity check, have you tried to simply run everything normally (not resuming, single gpu, etc.)? And then resuming with default parameters? Steadily working your way up to the full version of what you want to run.
EDIT: What version of
torchvisionare you using?
You could try to update your torchvision version, read on a different repo that it might help.
But if, as you said, you have no negative side effects yet when not loading the optimizer's state dict, might as well resume the way you are currently doing. Sorry I couldn't help you fix the problem.
You could try to update your
torchvisionversion, read on a different repo that it might help.But if, as you said, you have no negative side effects yet when not loading the optimizer's state dict, might as well resume the way you are currently doing. Sorry I couldn't help you fix the problem.
Thanks a lot~ I'd try a higher torchvision version.
@HViktorTsoi my mistake is the same as yours. Have you solved it?
@HViktorTsoi my mistake is the same as yours. Have you solved it?
Yes, I solved the problem by this https://github.com/jwyang/faster-rcnn.pytorch/issues/521#issuecomment-489911088 And it seems not having any side effect after a long time usage.
Hi, I am having a same problem.
I try to load the model that has been trained on pytorch==1.2.0 version.
When I load the model in pytorch==1.6.0 version and resumed training, the training gets so corrupted right after optimizer.step() has been called.
Would loading optimizer that has been trained on different version be an issue??
thanks! I've encountered the same issue and this solution works for me.
Yes... if I just use the out-of-the-box code, everything works just fine when traing from scratch, either on single or multiple GPUs, but the error always occurs when resuming training...
But as I descripted before, if I comment these two lines
# optimizer.load_state_dict(checkpoint['optimizer']) # lr = optimizer.param_groups[0]['lr']in
trainvla_net.pywhen resuming training, the training process goes on normally, and I've tested the modified code on my own dataset, the loss converged normally and the mAP I got from test sets was also acceptable.I'm using
SGDoptimizer, so it seems that there isn't any adverse effect so far if I don't load the state dict of this opptimizer when resuming training. But it remains to be varified that if it has any negative effect on other optimizers likeAdam.I'm using torchvision 0.2.1(Build py35_1).
And you still get the same error? Everything should work pretty much out-of-the-box; git pull and run.. As a sanity check, have you tried to simply run everything normally (not resuming, single gpu, etc.)? And then resuming with default parameters? Steadily working your way up to the full version of what you want to run. EDIT: What version of
torchvisionare you using?