RuntimeError: EncoderDecoder: VisionMambaSeg: shape '[-1, 14, 14, 192]' is invalid for input of size 37824
I tried to fine-tune the segmentation model using the pretrained Vim-T, but encountered the following issue while executing bash scripts/ft_vim_tiny_upernet.sh:
Position interpolate from 14x14 to 32x32
Traceback (most recent call last):
File "/home/vic1113/miniconda3/envs/vim_seg/lib/python3.9/site-packages/mmcv/utils/registry.py", line 69,
in build_from_cfg return obj_cls(**args)
File "/home/vic1113/PrMamba/seg/backbone/vim.py", line 89, in __init__
self.init_weights(pretrained)
File "/home/vic1113/PrMamba/seg/backbone/vim.py", line 143, in init_weights
interpolate_pos_embed(self, state_dict_model)
File "/home/vic1113/PrMamba/vim/utils.py", line 258, in interpolate_pos_embed
pos_tokens = pos_tokens.reshape(-1, orig_size, orig_size, embedding_size).permute(0, 3, 1, 2)
RuntimeError: shape '[-1, 14, 14, 192]' is invalid for input of size 37824
This error is propagated through multiple functions, resulting in the final error:
RuntimeError: EncoderDecoder: VisionMambaSeg: shape '[-1, 14, 14, 192]' is invalid for input of size 37824.
The pretrained weight I used was vim_t_midclstok_76p1acc.pth, which seems to be the correct one. If not, there should be an error while loading, such as size mismatch for norm_f.weight: copying a param with shape torch.Size([192]) from checkpoint, the shape in current model is torch.Size([384]), but I didn't get this error.
So, I guess there might be an issue with the model settings, but I’m not sure. 37824 = (14*14 + 1) * 192, and the "+1" is the part that leads to the error. If the "+1" part is for mid cls token, should I just drop it for the segmentation model?
Have anyone ever encountered this problem, or successfully finetuned a segmentation model?
Thank you very much!
same issue,have you fix it now?
No, I can't apply the pretrained weights to the segmentation model. It seems the shapes of the backbones are different, and we might need to retrain it.
I tried to fine-tune the segmentation model using the pretrained Vim-T, but encountered the following issue while executing
bash scripts/ft_vim_tiny_upernet.sh:Position interpolate from 14x14 to 32x32 Traceback (most recent call last): File "/home/vic1113/miniconda3/envs/vim_seg/lib/python3.9/site-packages/mmcv/utils/registry.py", line 69, in build_from_cfg return obj_cls(**args) File "/home/vic1113/PrMamba/seg/backbone/vim.py", line 89, in __init__ self.init_weights(pretrained) File "/home/vic1113/PrMamba/seg/backbone/vim.py", line 143, in init_weights interpolate_pos_embed(self, state_dict_model) File "/home/vic1113/PrMamba/vim/utils.py", line 258, in interpolate_pos_embed pos_tokens = pos_tokens.reshape(-1, orig_size, orig_size, embedding_size).permute(0, 3, 1, 2) RuntimeError: shape '[-1, 14, 14, 192]' is invalid for input of size 37824This error is propagated through multiple functions, resulting in the final error:
RuntimeError: EncoderDecoder: VisionMambaSeg: shape '[-1, 14, 14, 192]' is invalid for input of size 37824.The pretrained weight I used was
vim_t_midclstok_76p1acc.pth, which seems to be the correct one. If not, there should be an error while loading, such assize mismatch for norm_f.weight: copying a param with shape torch.Size([192]) from checkpoint, the shape in current model is torch.Size([384]), but I didn't get this error.So, I guess there might be an issue with the model settings, but I’m not sure. 37824 = (14*14 + 1) * 192, and the "+1" is the part that leads to the error. If the "+1" part is for mid cls token, should I just drop it for the segmentation model?
Have anyone ever encountered this problem, or successfully finetuned a segmentation model?
Thank you very much!
Is VisionMambaSeg a pre trained model? I used Vim small+(26M 81.6 95.4) when loading the pre trained model https://huggingface.co/hustvl/Vim-small-midclstok However, there are many mismatches, such as the absence of the checkpoint ['meta '] setting in mmcv in the downloaded model