Video-LLaVA icon indicating copy to clipboard operation
Video-LLaVA copied to clipboard

Hi, is there a bug in Video-LLaVA-main/videollava/model/multimodal_encoder/builder.py?

Open sunwhw opened this issue 2 years ago • 8 comments

I want to do finetune based on native llama and languagebind. In principle, if the model is downloaded locally, it will take the first "if" (because if is_absolute_path_exists is True), but this will cause it to a misalign error. image image

But if I manually switch to the second branch, it says imagetower and videotower's hiddendim are different. But I think my configuration files are all pulled from huggingface, there should be no configuration errors? So what causes such a strange phenomenon? image

sunwhw avatar Jan 27 '24 10:01 sunwhw

What is your "image tower"? The assertion function enforces the encoder's output dimension to be 1024. It appears that 768 is the dimension for a base version of the image encoder.

LinB203 avatar Jan 27 '24 14:01 LinB203

I have the same problem in local computer, but it works in https://colab.research.google.com/. error like: RuntimeError: Error(s) in loading state_dict for CLIPVisionModel: size mismatch for vision_model.embeddings.class_embedding: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).

DemiLulu avatar Jan 27 '24 15:01 DemiLulu

save issue

jiangtaoo2333 avatar Jan 29 '24 08:01 jiangtaoo2333

Hi everyone, what is your "image_tower"? is there a minimal runtime code to help me reproduce the error?

LinB203 avatar Jan 29 '24 08:01 LinB203

config file:

"intermediate_size": 11008, "max_position_embeddings": 4096, "mm_hidden_size": 1024, "mm_image_tower": "/home/demi/model_lib/LanguageBind_Image", "mm_projector_type": "mlp2x_gelu", "mm_use_x_patch_token": false, "mm_use_x_start_end": false, "mm_video_tower": "/home/demi/model_lib/LanguageBind_Video_merge", "mm_vision_select_feature": "patch", "mm_vision_select_layer": -2, "model_type": "llava", "num_attention_heads": 32,

image

DemiLulu avatar Jan 30 '24 12:01 DemiLulu

If you want to run model locally, maybe you can refer to this issue. https://github.com/PKU-YuanGroup/Video-LLaVA/issues/57#issuecomment-1880367313

LinB203 avatar Jan 30 '24 15:01 LinB203

I sovled! I changed the code just like

def build_image_tower(image_tower_cfg, **kwargs):
    image_tower = getattr(image_tower_cfg, 'mm_image_tower', getattr(image_tower_cfg, 'image_tower', None))
    is_absolute_path_exists = os.path.exists(image_tower)
    # if is_absolute_path_exists or image_tower.startswith("openai") or image_tower.startswith("laion"):
    #     return CLIPVisionTower(image_tower, args=image_tower_cfg, **kwargs) 
    if image_tower.startswith("openai") or image_tower.startswith("laion"):
        return CLIPVisionTower(image_tower, args=image_tower_cfg, **kwargs) 
    if image_tower.endswith('LanguageBind_Image'):
        return LanguageBindImageTower(image_tower, args=image_tower_cfg, cache_dir='./cache_dir', **kwargs)
    if 'mae' in image_tower:
        print('maemaemaemaemaemaemaemae')
        print('maemaemaemaemaemaemaemae')
        print('maemaemaemaemaemaemaemae')
        print('maemaemaemaemaemaemaemae')
        print('maemaemaemaemaemaemaemae')
        return MAEVisionTower(image_tower, args=image_tower_cfg, cache_dir='./cache_dir', **kwargs)
    raise ValueError(f'Unknown image tower: {image_tower}') 

In fact, if you choose running locally, and you should choose the second "if". I haven't changed anything else, but the "mismatch" error disappear, so it's still weird, but anyway, it works now!

sunwhw avatar Feb 05 '24 04:02 sunwhw

I sovled! I changed the code just like

def build_image_tower(image_tower_cfg, **kwargs):
    image_tower = getattr(image_tower_cfg, 'mm_image_tower', getattr(image_tower_cfg, 'image_tower', None))
    is_absolute_path_exists = os.path.exists(image_tower)
    # if is_absolute_path_exists or image_tower.startswith("openai") or image_tower.startswith("laion"):
    #     return CLIPVisionTower(image_tower, args=image_tower_cfg, **kwargs) 
    if image_tower.startswith("openai") or image_tower.startswith("laion"):
        return CLIPVisionTower(image_tower, args=image_tower_cfg, **kwargs) 
    if image_tower.endswith('LanguageBind_Image'):
        return LanguageBindImageTower(image_tower, args=image_tower_cfg, cache_dir='./cache_dir', **kwargs)
    if 'mae' in image_tower:
        print('maemaemaemaemaemaemaemae')
        print('maemaemaemaemaemaemaemae')
        print('maemaemaemaemaemaemaemae')
        print('maemaemaemaemaemaemaemae')
        print('maemaemaemaemaemaemaemae')
        return MAEVisionTower(image_tower, args=image_tower_cfg, cache_dir='./cache_dir', **kwargs)
    raise ValueError(f'Unknown image tower: {image_tower}') 

In fact, if you choose running locally, and you should choose the second "if". I haven't changed anything else, but the "mismatch" error disappear, so it's still weird, but anyway, it works now!

Great! Congrats

LinB203 avatar Feb 05 '24 05:02 LinB203