faster-rcnn.pytorch icon indicating copy to clipboard operation
faster-rcnn.pytorch copied to clipboard

Runtime Error when resuming training

Open HViktorTsoi opened this issue 6 years ago • 14 comments

I was trainig using multiple GPUs on my own dataset, but when resuming training, I got this error

Loading pretrained weights from data/pretrained_model/vgg16_caffe.pth
loading checkpoint models/vgg16/virtual_sign_2019/faster_rcnn_1_3_1124.pth
loaded checkpoint models/vgg16/virtual_sign_2019/faster_rcnn_1_3_1124.pth
Traceback (most recent call last):
  File "trainval_net.py", line 340, in <module>
    optimizer.step()
  File "/home/sy1806701/anaconda3/envs/pytorch/lib/python3.5/site-packages/torch/optim/sgd.py", line 101, in step
    buf.mul_(momentum).add_(1 - dampening, d_p)
RuntimeError: The expanded size of the tensor (3) must match the existing size (25088) at non-singleton dimension 3

Environment: Pytorch 0.4.0 CUDA 9.0 cuDNN 7.1.2 Python 3.5 GPUs: 4 x Tesla V100

Command line I used:

CUDA_VISIBLE_DEVICES=2,3,4,5 python trainval_net.py --dataset virtual_sign_2019 --net vgg16 --bs 32 --nw 16 --lr 0.001 --cuda --mGPUs --r True --checksession 1 --checkepoch 3 --checkpoint 1124

I have tried everything I can to solve this problem, incluing many issue related to this, like #515 #475 #506 , but the problem still exists....... is there any possilbe solution? thanks....

HViktorTsoi avatar Apr 25 '19 10:04 HViktorTsoi

Do you experience the same error when training on a single GPU, and then when resuming training on a single GPU?

AlexanderHustinx avatar Apr 26 '19 09:04 AlexanderHustinx

Do you experience the same error when training on a single GPU, and then when resuming training on a single GPU?

Yes, when using single GPU, the same Runtime Error occurs. I guess it's caused by optimizer.load_state_dict(checkpoint['optimizer']) in trainval_net.py while resuming training, because the error message points to "optimizer.step()" everytime. I tried to comment this two lines in trainval_net.py: https://github.com/jwyang/faster-rcnn.pytorch/blob/0797f6290e104e7d63cd487af759840d4a36985b/trainval_net.py#L283

# optimizer.load_state_dict(checkpoint['optimizer'])
# lr = optimizer.param_groups[0]['lr']

which means the saved state of optimizer wont't be loaded when resuming training, and actually it works, the Runtme Error never occurs again, and the training goes on. But I've no idea whether this is the right solution, and whether it will affect later traning prosess, neither do I know what on earth caused this problem.......

HViktorTsoi avatar Apr 29 '19 07:04 HViktorTsoi

I haven't been able to recreate your issue. Could you please send me the errors you get for 1 GPU and multiple GPUs? Can you send me a snippet of the code you're using from fasterRCNN.create_architecture() till ...

  if args.use_tfboard:
    from tensorboardX import SummaryWriter
    logger = SummaryWriter("logs")

Maybe I can spot an abnormality

AlexanderHustinx avatar May 03 '19 10:05 AlexanderHustinx

Sure. The errors I got for 1 GPU and multiple GPUs are the same, and both of them are as my description above, which is


Loading pretrained weights from data/pretrained_model/vgg16_caffe.pth
loading checkpoint models/vgg16/virtual_sign_2019/faster_rcnn_1_3_1124.pth
loaded checkpoint models/vgg16/virtual_sign_2019/faster_rcnn_1_3_1124.pth
Traceback (most recent call last):
  File "trainval_net.py", line 340, in <module>
    optimizer.step()
  File "/home/sy1806701/anaconda3/envs/pytorch/lib/python3.5/site-packages/torch/optim/sgd.py", line 101, in step
    buf.mul_(momentum).add_(1 - dampening, d_p)
RuntimeError: The expanded size of the tensor (3) must match the existing size (25088) at non-singleton dimension 3

The code snippet I‘m using is

    fasterRCNN.create_architecture()

    lr = cfg.TRAIN.LEARNING_RATE
    lr = args.lr
    # tr_momentum = cfg.TRAIN.MOMENTUM
    # tr_momentum = args.momentum

    params = []
    for key, value in dict(fasterRCNN.named_parameters()).items():
        if value.requires_grad:
            if 'bias' in key:
                params += [{'params': [value], 'lr': lr * (cfg.TRAIN.DOUBLE_BIAS + 1), \
                            'weight_decay': cfg.TRAIN.BIAS_DECAY and cfg.TRAIN.WEIGHT_DECAY or 0}]
            else:
                params += [{'params': [value], 'lr': lr, 'weight_decay': cfg.TRAIN.WEIGHT_DECAY}]

    if args.optimizer == "adam":
        lr = lr * 0.1
        optimizer = torch.optim.Adam(params)

    elif args.optimizer == "sgd":
        optimizer = torch.optim.SGD(params, momentum=cfg.TRAIN.MOMENTUM)

    if args.cuda:
        fasterRCNN.cuda()

    if args.resume:
        load_name = os.path.join(output_dir,
                                 'faster_rcnn_{}_{}_{}.pth'.format(args.checksession, args.checkepoch, args.checkpoint))
        print("loading checkpoint %s" % (load_name))
        checkpoint = torch.load(load_name)
        args.session = checkpoint['session']
        args.start_epoch = checkpoint['epoch']
        fasterRCNN.load_state_dict(checkpoint['model'])
        optimizer.load_state_dict(checkpoint['optimizer'])
        lr = optimizer.param_groups[0]['lr']
        if 'pooling_mode' in checkpoint.keys():
            cfg.POOLING_MODE = checkpoint['pooling_mode']
        print("loaded checkpoint %s" % (load_name))

    if args.mGPUs:
        fasterRCNN = nn.DataParallel(fasterRCNN)

    iters_per_epoch = int(train_size / args.batch_size)

    if args.use_tfboard:
        from tensorboardX import SummaryWriter

        logger = SummaryWriter("logs")

and by the way, the code actually wasn't modified, and is exactly the same as the master branch of this repo. Thanks~

HViktorTsoi avatar May 04 '19 09:05 HViktorTsoi

Could you try and modify your code as suggested in this comment

It has been merged into the pytorch-1.0 branch, but not the main branch. Maybe it will solve your problem as well.

AlexanderHustinx avatar May 06 '19 08:05 AlexanderHustinx

Could you try and modify your code as suggested in this comment

It has been merged into the pytorch-1.0 branch, but not the main branch. Maybe it will solve your problem as well.

Thanks... I've tried this code, I changed pytorch to 1.0 and used the 1.0 branch, and then moved if args.cuda: fasterRCNN.cuda() above the assignment of the optimizer(which has been done in the pytorch-1.0 branch), but when resuming training, the problem still exist...

HViktorTsoi avatar May 06 '19 12:05 HViktorTsoi

And you still get the same error?

Everything should work pretty much out-of-the-box; git pull and run.. As a sanity check, have you tried to simply run everything normally (not resuming, single gpu, etc.)? And then resuming with default parameters? Steadily working your way up to the full version of what you want to run.

EDIT: What version of torchvision are you using?

AlexanderHustinx avatar May 06 '19 13:05 AlexanderHustinx

Yes... if I just use the out-of-the-box code, everything works just fine when traing from scratch, either on single or multiple GPUs, but the error always occurs when resuming training...

But as I descripted before, if I comment these two lines

# optimizer.load_state_dict(checkpoint['optimizer'])
# lr = optimizer.param_groups[0]['lr']

in trainvla_net.py when resuming training, the training process goes on normally, and I've tested the modified code on my own dataset, the loss converged normally and the mAP I got from test sets was also acceptable.

I'm using SGD optimizer, so it seems that there isn't any adverse effect so far if I don't load the state dict of this opptimizer when resuming training. But it remains to be varified that if it has any negative effect on other optimizers like Adam.

I'm using torchvision 0.2.1(Build py35_1).

And you still get the same error?

Everything should work pretty much out-of-the-box; git pull and run.. As a sanity check, have you tried to simply run everything normally (not resuming, single gpu, etc.)? And then resuming with default parameters? Steadily working your way up to the full version of what you want to run.

EDIT: What version of torchvision are you using?

HViktorTsoi avatar May 07 '19 05:05 HViktorTsoi

You could try to update your torchvision version, read on a different repo that it might help.

But if, as you said, you have no negative side effects yet when not loading the optimizer's state dict, might as well resume the way you are currently doing. Sorry I couldn't help you fix the problem.

AlexanderHustinx avatar May 07 '19 08:05 AlexanderHustinx

You could try to update your torchvision version, read on a different repo that it might help.

But if, as you said, you have no negative side effects yet when not loading the optimizer's state dict, might as well resume the way you are currently doing. Sorry I couldn't help you fix the problem.

Thanks a lot~ I'd try a higher torchvision version.

HViktorTsoi avatar May 07 '19 17:05 HViktorTsoi

@HViktorTsoi my mistake is the same as yours. Have you solved it?

H-YunHui avatar Sep 14 '19 01:09 H-YunHui

@HViktorTsoi my mistake is the same as yours. Have you solved it?

Yes, I solved the problem by this https://github.com/jwyang/faster-rcnn.pytorch/issues/521#issuecomment-489911088 And it seems not having any side effect after a long time usage.

HViktorTsoi avatar Sep 17 '19 09:09 HViktorTsoi

Hi, I am having a same problem.

I try to load the model that has been trained on pytorch==1.2.0 version.

When I load the model in pytorch==1.6.0 version and resumed training, the training gets so corrupted right after optimizer.step() has been called.

Would loading optimizer that has been trained on different version be an issue??

YangJae96 avatar Nov 30 '21 04:11 YangJae96

thanks! I've encountered the same issue and this solution works for me.

Yes... if I just use the out-of-the-box code, everything works just fine when traing from scratch, either on single or multiple GPUs, but the error always occurs when resuming training...

But as I descripted before, if I comment these two lines

# optimizer.load_state_dict(checkpoint['optimizer'])
# lr = optimizer.param_groups[0]['lr']

in trainvla_net.py when resuming training, the training process goes on normally, and I've tested the modified code on my own dataset, the loss converged normally and the mAP I got from test sets was also acceptable.

I'm using SGD optimizer, so it seems that there isn't any adverse effect so far if I don't load the state dict of this opptimizer when resuming training. But it remains to be varified that if it has any negative effect on other optimizers like Adam.

I'm using torchvision 0.2.1(Build py35_1).

And you still get the same error? Everything should work pretty much out-of-the-box; git pull and run.. As a sanity check, have you tried to simply run everything normally (not resuming, single gpu, etc.)? And then resuming with default parameters? Steadily working your way up to the full version of what you want to run. EDIT: What version of torchvision are you using?

syr-cn avatar Jun 27 '22 11:06 syr-cn