LLaVA-HR Ablation study on using just single path encoder?

What if didn;t use convnext added vision encoder?

Mar 06 '24 06:03 lucasjinreal

The model you mentioned with a single visual path is exactly LLaVA-1.5. We have conduct extensive comparisons in Tab 1 of our paper. Check out our paper.

Mar 06 '24 08:03 luogen1996

I notcied that you enlarge the size in llava-1.5 are using interpolate positional embedding after calculate position_ids.

This would notiablly drop performance as model haven't seen large sizes when training.

What I mean is that, have u did experiment on enlarge input size by interpolate position embedding weight, and then train it along with vision encoder or full model.

How do u think the differences of these two ways.

(Your interpolate embedding seems not trainable parameters. I didn't see a resize_position_embedding before training here but just interpolate after calculated position_ids)

Mar 06 '24 09:03 lucasjinreal

I get it, maybe your mentioned way is better. Let me try it.

Mar 06 '24 09:03 luogen1996

Nice, let me know the differences between them after you tried.

Mar 06 '24 09:03 lucasjinreal

@luogen1996 Hello, I am doing sft stage2 follow your code, using zero3 finetune, got some warnings:

- vision_model.head.layernorm.bias: found shape torch.Size([1152]) in the checkpoint and torch.Size([0]) in the model instantiated
- vision_model.head.layernorm.weight: found shape torch.Size([1152]) in the checkpoint and torch.Size([0]) in the model instantiated
- vision_model.head.mlp.fc1.bias: found shape torch.Size([4304]) in the checkpoint and torch.Size([0]) in the model instantiated
- vision_model.head.mlp.fc1.weight: found shape torch.Size([4304, 1152]) in the checkpoint and torch.Size([0]) in the model instantiated
- vision_model.head.mlp.fc2.bias: found shape torch.Size([1152]) in the checkpoint and torch.Size([0]) in the model instantiated
- vision_model.head.mlp.fc2.weight: found shape torch.Size([1152, 4304]) in the checkpoint and torch.Size([0]) in the model instantiated
- vision_model.head.probe: found shape torch.Size([1, 1, 1152]) in the checkpoint and torch.Size([0]) in the model instantiated
- vision_model.post_layernorm.bias: found shape torch.Size([1152]) in the checkpoint and torch.Size([0]) in the model instantiated
- vision_model.post_layernorm.weight: found shape torch.Size([1152]) in the checkpoint and torch.Size([0]) in the model instantiated

there was some reports: https://github.com/microsoft/DeepSpeed/issues/3574 indicates it related to enabling gradient_checkpoing and zero3 at the same time.

Does this effect model training? It looks like loss are normal

Mar 07 '24 08:03 lucasjinreal

I don't see these warnings in my logs. Weights you print are not used in model, so maybe you can directly ignore them.

Mar 07 '24 09:03 luogen1996

The model you mentioned with a single visual path is exactly LLaVA-1.5. We have conduct extensive comparisons in Tab 1 of our paper. Check out our paper.

In table 1, are llava 1.5 trained on different resolutions? or only eval in different resolutions?

Mar 11 '24 02:03 BlueBlueFF

The model you mentioned with a single visual path is exactly LLaVA-1.5. We have conduct extensive comparisons in Tab 1 of our paper. Check out our paper.

In table 1, are llava 1.5 trained on different resolutions? or only eval in different resolutions?

@luogen1996 Thanks~

Mar 11 '24 02:03 BlueBlueFF

The model you mentioned with a single visual path is exactly LLaVA-1.5. We have conduct extensive comparisons in Tab 1 of our paper. Check out our paper.

In table 1, are llava 1.5 trained on different resolutions? or only eval in different resolutions?

Yes, we train llava 1.5 on different resolution. The training settings are the same as llava-hr, which includes low-resolution pre-training and high-resolution instruction tuning.

Mar 11 '24 02:03 luogen1996

@luogen1996 Hello, I am doing sft stage2 follow your code, using zero3 finetune, got some warnings:

- vision_model.head.layernorm.bias: found shape torch.Size([1152]) in the checkpoint and torch.Size([0]) in the model instantiated
- vision_model.head.layernorm.weight: found shape torch.Size([1152]) in the checkpoint and torch.Size([0]) in the model instantiated
- vision_model.head.mlp.fc1.bias: found shape torch.Size([4304]) in the checkpoint and torch.Size([0]) in the model instantiated
- vision_model.head.mlp.fc1.weight: found shape torch.Size([4304, 1152]) in the checkpoint and torch.Size([0]) in the model instantiated
- vision_model.head.mlp.fc2.bias: found shape torch.Size([1152]) in the checkpoint and torch.Size([0]) in the model instantiated
- vision_model.head.mlp.fc2.weight: found shape torch.Size([1152, 4304]) in the checkpoint and torch.Size([0]) in the model instantiated
- vision_model.head.probe: found shape torch.Size([1, 1, 1152]) in the checkpoint and torch.Size([0]) in the model instantiated
- vision_model.post_layernorm.bias: found shape torch.Size([1152]) in the checkpoint and torch.Size([0]) in the model instantiated
- vision_model.post_layernorm.weight: found shape torch.Size([1152]) in the checkpoint and torch.Size([0]) in the model instantiated

there was some reports: microsoft/DeepSpeed#3574 indicates it related to enabling gradient_checkpoing and zero3 at the same time.

Does this effect model training? It looks like loss are normal

Hello, I also meet this problem.. Could you please share how you resolved it?

Aug 06 '24 13:08 CserDu

LLaVA-HR LLaVA-HR copied to clipboard

Ablation study on using just single path encoder?

LLaVA-HR
LLaVA-HR copied to clipboard