DROID-SLAM icon indicating copy to clipboard operation
DROID-SLAM copied to clipboard

Error when using default weights "droid.pth" as pretrained weights

Open YznMur opened this issue 2 years ago • 14 comments

Hi @zachteed @xhangHU I couldn't use your weights "droid.pth" for training? I faced this error:

Traceback (most recent call last):
  File "train.py", line 189, in <module>
    mp.spawn(train, nprocs=args.gpus, args=(args,))
  File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 230, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
    while not context.join():
  File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 150, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
    fn(i, *args)
  File "/home/trainer/droidslam/train.py", line 60, in train
    model.load_state_dict(torch.load(args.ckpt))
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1482, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for DistributedDataParallel:
        size mismatch for module.update.weight.2.weight: copying a param with shape torch.Size([3, 128, 3, 3]) from checkpoint, the shape in current model is torch.Size([2, 128, 3, 3]).
        size mismatch for module.update.weight.2.bias: copying a param with shape torch.Size([3]) from checkpoint, the shape in current model is torch.Size([2]).
        size mismatch for module.update.delta.2.weight: copying a param with shape torch.Size([3, 128, 3, 3]) from checkpoint, the shape in current model is torch.Size([2, 128, 3, 3]).
        size mismatch for module.update.delta.2.bias: copying a param with shape torch.Size([3]) from checkpoint, the shape in current model is torch.Size([2]). 

I am trying to train the model on KITTI These are the parameters which I am using :

 clip=2.5,  edges=24, fmax=96.0, fmin=8.0, gpus=4, iters=15, lr=5e-05, n_frames=7, noise=False, restart_prob=0.2, scale=False, steps=250000, w1=10.0, w2=0.01, w3=0.05, world_size=4

YznMur avatar Jun 13 '22 16:06 YznMur

I figured it out and made some changes in class UpdateModule(nn.Module)

        self.weight = nn.Sequential(
            nn.Conv2d(128, 128, 3, padding=1),
            nn.ReLU(inplace=True),
            nn.Conv2d(128, 3, 3, padding=1),
            GradientClip(),
            nn.Sigmoid())

also

 self.delta = nn.Sequential(
            nn.Conv2d(128, 128, 3, padding=1),
            nn.ReLU(inplace=True),
            nn.Conv2d(128, 3, 3, padding=1),
            GradientClip())

if you have any advices about training the model on KITTI or training config, it will be appreciated !

YznMur avatar Jun 13 '22 20:06 YznMur

Why do we need to change the model shape for training vs inference?

felipesce avatar Oct 26 '22 18:10 felipesce