Video-LLaVA Hi， is there a bug in Video-LLaVA-main/videollava/model/multimodal

I want to do finetune based on native llama and languagebind. In principle, if the model is downloaded locally, it will take the first "if" (because if is_absolute_path_exists is True), but this will cause it to a misalign error.

But if I manually switch to the second branch, it says imagetower and videotower's hiddendim are different. But I think my configuration files are all pulled from huggingface, there should be no configuration errors? So what causes such a strange phenomenon？

Jan 27 '24 10:01 sunwhw

What is your "image tower"? The assertion function enforces the encoder's output dimension to be 1024. It appears that 768 is the dimension for a base version of the image encoder.

Jan 27 '24 14:01 LinB203

I have the same problem in local computer, but it works in https://colab.research.google.com/. error like: RuntimeError: Error(s) in loading state_dict for CLIPVisionModel: size mismatch for vision_model.embeddings.class_embedding: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).

Jan 27 '24 15:01 DemiLulu

save issue

Jan 29 '24 08:01 jiangtaoo2333

Hi everyone, what is your "image_tower"? is there a minimal runtime code to help me reproduce the error?

Jan 29 '24 08:01 LinB203

config file:

"intermediate_size": 11008, "max_position_embeddings": 4096, "mm_hidden_size": 1024, "mm_image_tower": "/home/demi/model_lib/LanguageBind_Image", "mm_projector_type": "mlp2x_gelu", "mm_use_x_patch_token": false, "mm_use_x_start_end": false, "mm_video_tower": "/home/demi/model_lib/LanguageBind_Video_merge", "mm_vision_select_feature": "patch", "mm_vision_select_layer": -2, "model_type": "llava", "num_attention_heads": 32,

Jan 30 '24 12:01 DemiLulu

If you want to run model locally, maybe you can refer to this issue. https://github.com/PKU-YuanGroup/Video-LLaVA/issues/57#issuecomment-1880367313

Jan 30 '24 15:01 LinB203

I sovled! I changed the code just like

def build_image_tower(image_tower_cfg, **kwargs):
    image_tower = getattr(image_tower_cfg, 'mm_image_tower', getattr(image_tower_cfg, 'image_tower', None))
    is_absolute_path_exists = os.path.exists(image_tower)
    # if is_absolute_path_exists or image_tower.startswith("openai") or image_tower.startswith("laion"):
    #     return CLIPVisionTower(image_tower, args=image_tower_cfg, **kwargs) 
    if image_tower.startswith("openai") or image_tower.startswith("laion"):
        return CLIPVisionTower(image_tower, args=image_tower_cfg, **kwargs) 
    if image_tower.endswith('LanguageBind_Image'):
        return LanguageBindImageTower(image_tower, args=image_tower_cfg, cache_dir='./cache_dir', **kwargs)
    if 'mae' in image_tower:
        print('maemaemaemaemaemaemaemae')
        print('maemaemaemaemaemaemaemae')
        print('maemaemaemaemaemaemaemae')
        print('maemaemaemaemaemaemaemae')
        print('maemaemaemaemaemaemaemae')
        return MAEVisionTower(image_tower, args=image_tower_cfg, cache_dir='./cache_dir', **kwargs)
    raise ValueError(f'Unknown image tower: {image_tower}')

In fact, if you choose running locally, and you should choose the second "if". I haven't changed anything else, but the "mismatch" error disappear, so it's still weird, but anyway, it works now！

Feb 05 '24 04:02 sunwhw

I sovled! I changed the code just like

def build_image_tower(image_tower_cfg, **kwargs):
    image_tower = getattr(image_tower_cfg, 'mm_image_tower', getattr(image_tower_cfg, 'image_tower', None))
    is_absolute_path_exists = os.path.exists(image_tower)
    # if is_absolute_path_exists or image_tower.startswith("openai") or image_tower.startswith("laion"):
    #     return CLIPVisionTower(image_tower, args=image_tower_cfg, **kwargs) 
    if image_tower.startswith("openai") or image_tower.startswith("laion"):
        return CLIPVisionTower(image_tower, args=image_tower_cfg, **kwargs) 
    if image_tower.endswith('LanguageBind_Image'):
        return LanguageBindImageTower(image_tower, args=image_tower_cfg, cache_dir='./cache_dir', **kwargs)
    if 'mae' in image_tower:
        print('maemaemaemaemaemaemaemae')
        print('maemaemaemaemaemaemaemae')
        print('maemaemaemaemaemaemaemae')
        print('maemaemaemaemaemaemaemae')
        print('maemaemaemaemaemaemaemae')
        return MAEVisionTower(image_tower, args=image_tower_cfg, cache_dir='./cache_dir', **kwargs)
    raise ValueError(f'Unknown image tower: {image_tower}')

In fact, if you choose running locally, and you should choose the second "if". I haven't changed anything else, but the "mismatch" error disappear, so it's still weird, but anyway, it works now！

Great! Congrats

Feb 05 '24 05:02 LinB203

Hi， is there a bug in Video-LLaVA-main/videollava/model/multimodal_encoder/builder.py?