Vim RuntimeError: EncoderDecoder: VisionMambaSeg: shape '[-1, 14, 14, 192]' is invalid for input of size 37824

I tried to fine-tune the segmentation model using the pretrained Vim-T, but encountered the following issue while executing bash scripts/ft_vim_tiny_upernet.sh:

Position interpolate from 14x14 to 32x32   
Traceback (most recent call last):
  File "/home/vic1113/miniconda3/envs/vim_seg/lib/python3.9/site-packages/mmcv/utils/registry.py", line 69, 
in build_from_cfg return obj_cls(**args)
  File "/home/vic1113/PrMamba/seg/backbone/vim.py", line 89, in __init__
    self.init_weights(pretrained)
  File "/home/vic1113/PrMamba/seg/backbone/vim.py", line 143, in init_weights
    interpolate_pos_embed(self, state_dict_model)
  File "/home/vic1113/PrMamba/vim/utils.py", line 258, in interpolate_pos_embed
    pos_tokens = pos_tokens.reshape(-1, orig_size, orig_size, embedding_size).permute(0, 3, 1, 2)
RuntimeError: shape '[-1, 14, 14, 192]' is invalid for input of size 37824

This error is propagated through multiple functions, resulting in the final error: RuntimeError: EncoderDecoder: VisionMambaSeg: shape '[-1, 14, 14, 192]' is invalid for input of size 37824.

The pretrained weight I used was vim_t_midclstok_76p1acc.pth, which seems to be the correct one. If not, there should be an error while loading, such as size mismatch for norm_f.weight: copying a param with shape torch.Size([192]) from checkpoint, the shape in current model is torch.Size([384]), but I didn't get this error.

So, I guess there might be an issue with the model settings, but I’m not sure. 37824 = (14*14 + 1) * 192, and the "+1" is the part that leads to the error. If the "+1" part is for mid cls token, should I just drop it for the segmentation model?

Have anyone ever encountered this problem, or successfully finetuned a segmentation model?

Thank you very much!

Jul 26 '24 19:07 VickyHuang1113

same issue，have you fix it now?

Jul 30 '24 14:07 GIT-HYQ

No, I can't apply the pretrained weights to the segmentation model. It seems the shapes of the backbones are different, and we might need to retrain it.

Jul 30 '24 15:07 VickyHuang1113

I tried to fine-tune the segmentation model using the pretrained Vim-T, but encountered the following issue while executing bash scripts/ft_vim_tiny_upernet.sh:
Position interpolate from 14x14 to 32x32   
Traceback (most recent call last):
  File "/home/vic1113/miniconda3/envs/vim_seg/lib/python3.9/site-packages/mmcv/utils/registry.py", line 69, 
in build_from_cfg return obj_cls(**args)
  File "/home/vic1113/PrMamba/seg/backbone/vim.py", line 89, in __init__
    self.init_weights(pretrained)
  File "/home/vic1113/PrMamba/seg/backbone/vim.py", line 143, in init_weights
    interpolate_pos_embed(self, state_dict_model)
  File "/home/vic1113/PrMamba/vim/utils.py", line 258, in interpolate_pos_embed
    pos_tokens = pos_tokens.reshape(-1, orig_size, orig_size, embedding_size).permute(0, 3, 1, 2)
RuntimeError: shape '[-1, 14, 14, 192]' is invalid for input of size 37824
This error is propagated through multiple functions, resulting in the final error: RuntimeError: EncoderDecoder: VisionMambaSeg: shape '[-1, 14, 14, 192]' is invalid for input of size 37824.

The pretrained weight I used was vim_t_midclstok_76p1acc.pth, which seems to be the correct one. If not, there should be an error while loading, such as size mismatch for norm_f.weight: copying a param with shape torch.Size([192]) from checkpoint, the shape in current model is torch.Size([384]), but I didn't get this error.

So, I guess there might be an issue with the model settings, but I’m not sure. 37824 = (14*14 + 1) * 192, and the "+1" is the part that leads to the error. If the "+1" part is for mid cls token, should I just drop it for the segmentation model?

Have anyone ever encountered this problem, or successfully finetuned a segmentation model?

Thank you very much!

Is VisionMambaSeg a pre trained model? I used Vim small+(26M 81.6 95.4) when loading the pre trained model https://huggingface.co/hustvl/Vim-small-midclstok However, there are many mismatches, such as the absence of the checkpoint ['meta '] setting in mmcv in the downloaded model

Dec 16 '24 06:12 TiSgrc2002