LLaVA-HR icon indicating copy to clipboard operation
LLaVA-HR copied to clipboard

Ablation study on using just single path encoder?

Open lucasjinreal opened this issue 1 year ago • 10 comments

What if didn;t use convnext added vision encoder?

lucasjinreal avatar Mar 06 '24 06:03 lucasjinreal

The model you mentioned with a single visual path is exactly LLaVA-1.5. We have conduct extensive comparisons in Tab 1 of our paper. Check out our paper. image

luogen1996 avatar Mar 06 '24 08:03 luogen1996

I notcied that you enlarge the size in llava-1.5 are using interpolate positional embedding after calculate position_ids.

This would notiablly drop performance as model haven't seen large sizes when training.

What I mean is that, have u did experiment on enlarge input size by interpolate position embedding weight, and then train it along with vision encoder or full model.

How do u think the differences of these two ways.

(Your interpolate embedding seems not trainable parameters. I didn't see a resize_position_embedding before training here but just interpolate after calculated position_ids)

lucasjinreal avatar Mar 06 '24 09:03 lucasjinreal

I get it, maybe your mentioned way is better. Let me try it.

luogen1996 avatar Mar 06 '24 09:03 luogen1996

Nice, let me know the differences between them after you tried.

lucasjinreal avatar Mar 06 '24 09:03 lucasjinreal

@luogen1996 Hello, I am doing sft stage2 follow your code, using zero3 finetune, got some warnings:

- vision_model.head.layernorm.bias: found shape torch.Size([1152]) in the checkpoint and torch.Size([0]) in the model instantiated
- vision_model.head.layernorm.weight: found shape torch.Size([1152]) in the checkpoint and torch.Size([0]) in the model instantiated
- vision_model.head.mlp.fc1.bias: found shape torch.Size([4304]) in the checkpoint and torch.Size([0]) in the model instantiated
- vision_model.head.mlp.fc1.weight: found shape torch.Size([4304, 1152]) in the checkpoint and torch.Size([0]) in the model instantiated
- vision_model.head.mlp.fc2.bias: found shape torch.Size([1152]) in the checkpoint and torch.Size([0]) in the model instantiated
- vision_model.head.mlp.fc2.weight: found shape torch.Size([1152, 4304]) in the checkpoint and torch.Size([0]) in the model instantiated
- vision_model.head.probe: found shape torch.Size([1, 1, 1152]) in the checkpoint and torch.Size([0]) in the model instantiated
- vision_model.post_layernorm.bias: found shape torch.Size([1152]) in the checkpoint and torch.Size([0]) in the model instantiated
- vision_model.post_layernorm.weight: found shape torch.Size([1152]) in the checkpoint and torch.Size([0]) in the model instantiated

there was some reports: https://github.com/microsoft/DeepSpeed/issues/3574 indicates it related to enabling gradient_checkpoing and zero3 at the same time.

Does this effect model training? It looks like loss are normal

lucasjinreal avatar Mar 07 '24 08:03 lucasjinreal

I don't see these warnings in my logs. Weights you print are not used in model, so maybe you can directly ignore them.

luogen1996 avatar Mar 07 '24 09:03 luogen1996

The model you mentioned with a single visual path is exactly LLaVA-1.5. We have conduct extensive comparisons in Tab 1 of our paper. Check out our paper. image

In table 1, are llava 1.5 trained on different resolutions? or only eval in different resolutions?

BlueBlueFF avatar Mar 11 '24 02:03 BlueBlueFF

The model you mentioned with a single visual path is exactly LLaVA-1.5. We have conduct extensive comparisons in Tab 1 of our paper. Check out our paper. image

In table 1, are llava 1.5 trained on different resolutions? or only eval in different resolutions?

@luogen1996 Thanks~

BlueBlueFF avatar Mar 11 '24 02:03 BlueBlueFF

The model you mentioned with a single visual path is exactly LLaVA-1.5. We have conduct extensive comparisons in Tab 1 of our paper. Check out our paper. image

In table 1, are llava 1.5 trained on different resolutions? or only eval in different resolutions?

Yes, we train llava 1.5 on different resolution. The training settings are the same as llava-hr, which includes low-resolution pre-training and high-resolution instruction tuning.

luogen1996 avatar Mar 11 '24 02:03 luogen1996

@luogen1996 Hello, I am doing sft stage2 follow your code, using zero3 finetune, got some warnings:

- vision_model.head.layernorm.bias: found shape torch.Size([1152]) in the checkpoint and torch.Size([0]) in the model instantiated
- vision_model.head.layernorm.weight: found shape torch.Size([1152]) in the checkpoint and torch.Size([0]) in the model instantiated
- vision_model.head.mlp.fc1.bias: found shape torch.Size([4304]) in the checkpoint and torch.Size([0]) in the model instantiated
- vision_model.head.mlp.fc1.weight: found shape torch.Size([4304, 1152]) in the checkpoint and torch.Size([0]) in the model instantiated
- vision_model.head.mlp.fc2.bias: found shape torch.Size([1152]) in the checkpoint and torch.Size([0]) in the model instantiated
- vision_model.head.mlp.fc2.weight: found shape torch.Size([1152, 4304]) in the checkpoint and torch.Size([0]) in the model instantiated
- vision_model.head.probe: found shape torch.Size([1, 1, 1152]) in the checkpoint and torch.Size([0]) in the model instantiated
- vision_model.post_layernorm.bias: found shape torch.Size([1152]) in the checkpoint and torch.Size([0]) in the model instantiated
- vision_model.post_layernorm.weight: found shape torch.Size([1152]) in the checkpoint and torch.Size([0]) in the model instantiated

there was some reports: microsoft/DeepSpeed#3574 indicates it related to enabling gradient_checkpoing and zero3 at the same time.

Does this effect model training? It looks like loss are normal

Hello, I also meet this problem.. Could you please share how you resolved it?

CserDu avatar Aug 06 '24 13:08 CserDu