MoE-LLaVA
MoE-LLaVA copied to clipboard
RuntimeError: mat1 and mat2 must have the same dtype
logs that inferred custom trained model (LanguageBind + Qwen14B LLM).
(moellava) root@ps:/code/MoE-LLaVA# CUDA_VISIBLE_DEVICES=0 python predict.py
[2024-03-18 02:02:14,276] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
/opt/conda/envs/moellava/lib/python3.10/site-packages/torchvision/transforms/_functional_video.py:6: UserWarning: The 'torchvision.transforms._functional_video' module is deprecated since 0.12 and will be removed in the future. Please use the 'torchvision.transforms.functional' module instead.
warnings.warn(
/opt/conda/envs/moellava/lib/python3.10/site-packages/torchvision/transforms/_transforms_video.py:22: UserWarning: The 'torchvision.transforms._transforms_video' module is deprecated since 0.12 and will be removed in the future. Please use the 'torchvision.transforms' module instead.
warnings.warn(
/opt/conda/envs/moellava/lib/python3.10/site-packages/torchvision/transforms/functional_tensor.py:5: UserWarning: The torchvision.transforms.functional_tensor module is deprecated in 0.15 and will be **removed in 0.17**. Please don't rely on it. You probably just need to use APIs in torchvision.transforms.functional or in torchvision.transforms.v2.functional.
warnings.warn(
The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained".
projector_type: mlp2x_gelu
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:24<00:00, 4.07s/it]
Some weights of the model checkpoint at /output/llava-qwen14/checkpoint-200/ were not used when initializing LlavaQWenForCausalLM: ['transformer.image_tower.image_tower.embeddings.class_embedding', 'transformer.image_tower.image_tower.embeddings.patch_embedding.weight', 'transformer.image_tower.image_tower.embeddings.position_embedding.weight', 'transformer.image_tower.image_tower.encoder.layers.0.layer_norm1.bias', 'transformer.image_tower.image_tower.encoder.layers.0.layer_norm1.weight', 'transformer.image_tower.image_tower.encoder.layers.0.layer_norm2.bias', 'transformer.image_tower.image_tower.encoder.layers.0.layer_norm2.weight', 'transformer.image_tower.image_tower.encoder.layers.0.mlp.fc1.bias', 'transformer.image_tower.image_tower.encoder.layers.0.mlp.fc1.weight', 'transformer.image_tower.image_tower.encoder.layers.0.mlp.fc2.bias', 'transformer.image_tower.image_tower.encoder.layers.0.mlp.fc2.weight', 'transformer.image_tower.image_tower.encoder.layers.0.self_attn.k_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.0.self_attn.k_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.0.self_attn.out_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.0.self_attn.out_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.0.self_attn.q_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.0.self_attn.q_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.0.self_attn.v_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.0.self_attn.v_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.1.layer_norm1.bias', 'transformer.image_tower.image_tower.encoder.layers.1.layer_norm1.weight', 'transformer.image_tower.image_tower.encoder.layers.1.layer_norm2.bias', 'transformer.image_tower.image_tower.encoder.layers.1.layer_norm2.weight', 'transformer.image_tower.image_tower.encoder.layers.1.mlp.fc1.bias', 'transformer.image_tower.image_tower.encoder.layers.1.mlp.fc1.weight', 'transformer.image_tower.image_tower.encoder.layers.1.mlp.fc2.bias', 'transformer.image_tower.image_tower.encoder.layers.1.mlp.fc2.weight', 'transformer.image_tower.image_tower.encoder.layers.1.self_attn.k_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.1.self_attn.k_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.1.self_attn.out_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.1.self_attn.out_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.1.self_attn.q_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.1.self_attn.q_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.1.self_attn.v_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.1.self_attn.v_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.10.layer_norm1.bias', 'transformer.image_tower.image_tower.encoder.layers.10.layer_norm1.weight', 'transformer.image_tower.image_tower.encoder.layers.10.layer_norm2.bias', 'transformer.image_tower.image_tower.encoder.layers.10.layer_norm2.weight', 'transformer.image_tower.image_tower.encoder.layers.10.mlp.fc1.bias', 'transformer.image_tower.image_tower.encoder.layers.10.mlp.fc1.weight', 'transformer.image_tower.image_tower.encoder.layers.10.mlp.fc2.bias', 'transformer.image_tower.image_tower.encoder.layers.10.mlp.fc2.weight', 'transformer.image_tower.image_tower.encoder.layers.10.self_attn.k_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.10.self_attn.k_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.10.self_attn.out_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.10.self_attn.out_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.10.self_attn.q_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.10.self_attn.q_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.10.self_attn.v_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.10.self_attn.v_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.11.layer_norm1.bias', 'transformer.image_tower.image_tower.encoder.layers.11.layer_norm1.weight', 'transformer.image_tower.image_tower.encoder.layers.11.layer_norm2.bias', 'transformer.image_tower.image_tower.encoder.layers.11.layer_norm2.weight', 'transformer.image_tower.image_tower.encoder.layers.11.mlp.fc1.bias', 'transformer.image_tower.image_tower.encoder.layers.11.mlp.fc1.weight', 'transformer.image_tower.image_tower.encoder.layers.11.mlp.fc2.bias', 'transformer.image_tower.image_tower.encoder.layers.11.mlp.fc2.weight', 'transformer.image_tower.image_tower.encoder.layers.11.self_attn.k_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.11.self_attn.k_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.11.self_attn.out_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.11.self_attn.out_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.11.self_attn.q_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.11.self_attn.q_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.11.self_attn.v_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.11.self_attn.v_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.12.layer_norm1.bias', 'transformer.image_tower.image_tower.encoder.layers.12.layer_norm1.weight', 'transformer.image_tower.image_tower.encoder.layers.12.layer_norm2.bias', 'transformer.image_tower.image_tower.encoder.layers.12.layer_norm2.weight', 'transformer.image_tower.image_tower.encoder.layers.12.mlp.fc1.bias', 'transformer.image_tower.image_tower.encoder.layers.12.mlp.fc1.weight', 'transformer.image_tower.image_tower.encoder.layers.12.mlp.fc2.bias', 'transformer.image_tower.image_tower.encoder.layers.12.mlp.fc2.weight', 'transformer.image_tower.image_tower.encoder.layers.12.self_attn.k_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.12.self_attn.k_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.12.self_attn.out_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.12.self_attn.out_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.12.self_attn.q_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.12.self_attn.q_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.12.self_attn.v_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.12.self_attn.v_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.13.layer_norm1.bias', 'transformer.image_tower.image_tower.encoder.layers.13.layer_norm1.weight', 'transformer.image_tower.image_tower.encoder.layers.13.layer_norm2.bias', 'transformer.image_tower.image_tower.encoder.layers.13.layer_norm2.weight', 'transformer.image_tower.image_tower.encoder.layers.13.mlp.fc1.bias', 'transformer.image_tower.image_tower.encoder.layers.13.mlp.fc1.weight', 'transformer.image_tower.image_tower.encoder.layers.13.mlp.fc2.bias', 'transformer.image_tower.image_tower.encoder.layers.13.mlp.fc2.weight', 'transformer.image_tower.image_tower.encoder.layers.13.self_attn.k_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.13.self_attn.k_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.13.self_attn.out_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.13.self_attn.out_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.13.self_attn.q_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.13.self_attn.q_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.13.self_attn.v_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.13.self_attn.v_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.14.layer_norm1.bias', 'transformer.image_tower.image_tower.encoder.layers.14.layer_norm1.weight', 'transformer.image_tower.image_tower.encoder.layers.14.layer_norm2.bias', 'transformer.image_tower.image_tower.encoder.layers.14.layer_norm2.weight', 'transformer.image_tower.image_tower.encoder.layers.14.mlp.fc1.bias', 'transformer.image_tower.image_tower.encoder.layers.14.mlp.fc1.weight', 'transformer.image_tower.image_tower.encoder.layers.14.mlp.fc2.bias', 'transformer.image_tower.image_tower.encoder.layers.14.mlp.fc2.weight', 'transformer.image_tower.image_tower.encoder.layers.14.self_attn.k_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.14.self_attn.k_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.14.self_attn.out_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.14.self_attn.out_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.14.self_attn.q_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.14.self_attn.q_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.14.self_attn.v_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.14.self_attn.v_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.15.layer_norm1.bias', 'transformer.image_tower.image_tower.encoder.layers.15.layer_norm1.weight', 'transformer.image_tower.image_tower.encoder.layers.15.layer_norm2.bias', 'transformer.image_tower.image_tower.encoder.layers.15.layer_norm2.weight', 'transformer.image_tower.image_tower.encoder.layers.15.mlp.fc1.bias', 'transformer.image_tower.image_tower.encoder.layers.15.mlp.fc1.weight', 'transformer.image_tower.image_tower.encoder.layers.15.mlp.fc2.bias', 'transformer.image_tower.image_tower.encoder.layers.15.mlp.fc2.weight', 'transformer.image_tower.image_tower.encoder.layers.15.self_attn.k_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.15.self_attn.k_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.15.self_attn.out_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.15.self_attn.out_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.15.self_attn.q_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.15.self_attn.q_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.15.self_attn.v_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.15.self_attn.v_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.16.layer_norm1.bias', 'transformer.image_tower.image_tower.encoder.layers.16.layer_norm1.weight', 'transformer.image_tower.image_tower.encoder.layers.16.layer_norm2.bias', 'transformer.image_tower.image_tower.encoder.layers.16.layer_norm2.weight', 'transformer.image_tower.image_tower.encoder.layers.16.mlp.fc1.bias', 'transformer.image_tower.image_tower.encoder.layers.16.mlp.fc1.weight', 'transformer.image_tower.image_tower.encoder.layers.16.mlp.fc2.bias', 'transformer.image_tower.image_tower.encoder.layers.16.mlp.fc2.weight', 'transformer.image_tower.image_tower.encoder.layers.16.self_attn.k_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.16.self_attn.k_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.16.self_attn.out_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.16.self_attn.out_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.16.self_attn.q_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.16.self_attn.q_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.16.self_attn.v_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.16.self_attn.v_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.17.layer_norm1.bias', 'transformer.image_tower.image_tower.encoder.layers.17.layer_norm1.weight', 'transformer.image_tower.image_tower.encoder.layers.17.layer_norm2.bias', 'transformer.image_tower.image_tower.encoder.layers.17.layer_norm2.weight', 'transformer.image_tower.image_tower.encoder.layers.17.mlp.fc1.bias', 'transformer.image_tower.image_tower.encoder.layers.17.mlp.fc1.weight', 'transformer.image_tower.image_tower.encoder.layers.17.mlp.fc2.bias', 'transformer.image_tower.image_tower.encoder.layers.17.mlp.fc2.weight', 'transformer.image_tower.image_tower.encoder.layers.17.self_attn.k_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.17.self_attn.k_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.17.self_attn.out_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.17.self_attn.out_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.17.self_attn.q_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.17.self_attn.q_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.17.self_attn.v_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.17.self_attn.v_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.18.layer_norm1.bias', 'transformer.image_tower.image_tower.encoder.layers.18.layer_norm1.weight', 'transformer.image_tower.image_tower.encoder.layers.18.layer_norm2.bias', 'transformer.image_tower.image_tower.encoder.layers.18.layer_norm2.weight', 'transformer.image_tower.image_tower.encoder.layers.18.mlp.fc1.bias', 'transformer.image_tower.image_tower.encoder.layers.18.mlp.fc1.weight', 'transformer.image_tower.image_tower.encoder.layers.18.mlp.fc2.bias', 'transformer.image_tower.image_tower.encoder.layers.18.mlp.fc2.weight', 'transformer.image_tower.image_tower.encoder.layers.18.self_attn.k_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.18.self_attn.k_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.18.self_attn.out_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.18.self_attn.out_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.18.self_attn.q_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.18.self_attn.q_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.18.self_attn.v_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.18.self_attn.v_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.19.layer_norm1.bias', 'transformer.image_tower.image_tower.encoder.layers.19.layer_norm1.weight', 'transformer.image_tower.image_tower.encoder.layers.19.layer_norm2.bias', 'transformer.image_tower.image_tower.encoder.layers.19.layer_norm2.weight', 'transformer.image_tower.image_tower.encoder.layers.19.mlp.fc1.bias', 'transformer.image_tower.image_tower.encoder.layers.19.mlp.fc1.weight', 'transformer.image_tower.image_tower.encoder.layers.19.mlp.fc2.bias', 'transformer.image_tower.image_tower.encoder.layers.19.mlp.fc2.weight', 'transformer.image_tower.image_tower.encoder.layers.19.self_attn.k_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.19.self_attn.k_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.19.self_attn.out_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.19.self_attn.out_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.19.self_attn.q_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.19.self_attn.q_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.19.self_attn.v_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.19.self_attn.v_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.2.layer_norm1.bias', 'transformer.image_tower.image_tower.encoder.layers.2.layer_norm1.weight', 'transformer.image_tower.image_tower.encoder.layers.2.layer_norm2.bias', 'transformer.image_tower.image_tower.encoder.layers.2.layer_norm2.weight', 'transformer.image_tower.image_tower.encoder.layers.2.mlp.fc1.bias', 'transformer.image_tower.image_tower.encoder.layers.2.mlp.fc1.weight', 'transformer.image_tower.image_tower.encoder.layers.2.mlp.fc2.bias', 'transformer.image_tower.image_tower.encoder.layers.2.mlp.fc2.weight', 'transformer.image_tower.image_tower.encoder.layers.2.self_attn.k_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.2.self_attn.k_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.2.self_attn.out_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.2.self_attn.out_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.2.self_attn.q_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.2.self_attn.q_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.2.self_attn.v_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.2.self_attn.v_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.20.layer_norm1.bias', 'transformer.image_tower.image_tower.encoder.layers.20.layer_norm1.weight', 'transformer.image_tower.image_tower.encoder.layers.20.layer_norm2.bias', 'transformer.image_tower.image_tower.encoder.layers.20.layer_norm2.weight', 'transformer.image_tower.image_tower.encoder.layers.20.mlp.fc1.bias', 'transformer.image_tower.image_tower.encoder.layers.20.mlp.fc1.weight', 'transformer.image_tower.image_tower.encoder.layers.20.mlp.fc2.bias', 'transformer.image_tower.image_tower.encoder.layers.20.mlp.fc2.weight', 'transformer.image_tower.image_tower.encoder.layers.20.self_attn.k_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.20.self_attn.k_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.20.self_attn.out_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.20.self_attn.out_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.20.self_attn.q_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.20.self_attn.q_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.20.self_attn.v_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.20.self_attn.v_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.21.layer_norm1.bias', 'transformer.image_tower.image_tower.encoder.layers.21.layer_norm1.weight', 'transformer.image_tower.image_tower.encoder.layers.21.layer_norm2.bias', 'transformer.image_tower.image_tower.encoder.layers.21.layer_norm2.weight', 'transformer.image_tower.image_tower.encoder.layers.21.mlp.fc1.bias', 'transformer.image_tower.image_tower.encoder.layers.21.mlp.fc1.weight', 'transformer.image_tower.image_tower.encoder.layers.21.mlp.fc2.bias', 'transformer.image_tower.image_tower.encoder.layers.21.mlp.fc2.weight', 'transformer.image_tower.image_tower.encoder.layers.21.self_attn.k_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.21.self_attn.k_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.21.self_attn.out_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.21.self_attn.out_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.21.self_attn.q_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.21.self_attn.q_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.21.self_attn.v_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.21.self_attn.v_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.22.layer_norm1.bias', 'transformer.image_tower.image_tower.encoder.layers.22.layer_norm1.weight', 'transformer.image_tower.image_tower.encoder.layers.22.layer_norm2.bias', 'transformer.image_tower.image_tower.encoder.layers.22.layer_norm2.weight', 'transformer.image_tower.image_tower.encoder.layers.22.mlp.fc1.bias', 'transformer.image_tower.image_tower.encoder.layers.22.mlp.fc1.weight', 'transformer.image_tower.image_tower.encoder.layers.22.mlp.fc2.bias', 'transformer.image_tower.image_tower.encoder.layers.22.mlp.fc2.weight', 'transformer.image_tower.image_tower.encoder.layers.22.self_attn.k_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.22.self_attn.k_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.22.self_attn.out_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.22.self_attn.out_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.22.self_attn.q_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.22.self_attn.q_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.22.self_attn.v_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.22.self_attn.v_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.23.layer_norm1.bias', 'transformer.image_tower.image_tower.encoder.layers.23.layer_norm1.weight', 'transformer.image_tower.image_tower.encoder.layers.23.layer_norm2.bias', 'transformer.image_tower.image_tower.encoder.layers.23.layer_norm2.weight', 'transformer.image_tower.image_tower.encoder.layers.23.mlp.fc1.bias', 'transformer.image_tower.image_tower.encoder.layers.23.mlp.fc1.weight', 'transformer.image_tower.image_tower.encoder.layers.23.mlp.fc2.bias', 'transformer.image_tower.image_tower.encoder.layers.23.mlp.fc2.weight', 'transformer.image_tower.image_tower.encoder.layers.23.self_attn.k_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.23.self_attn.k_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.23.self_attn.out_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.23.self_attn.out_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.23.self_attn.q_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.23.self_attn.q_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.23.self_attn.v_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.23.self_attn.v_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.3.layer_norm1.bias', 'transformer.image_tower.image_tower.encoder.layers.3.layer_norm1.weight', 'transformer.image_tower.image_tower.encoder.layers.3.layer_norm2.bias', 'transformer.image_tower.image_tower.encoder.layers.3.layer_norm2.weight', 'transformer.image_tower.image_tower.encoder.layers.3.mlp.fc1.bias', 'transformer.image_tower.image_tower.encoder.layers.3.mlp.fc1.weight', 'transformer.image_tower.image_tower.encoder.layers.3.mlp.fc2.bias', 'transformer.image_tower.image_tower.encoder.layers.3.mlp.fc2.weight', 'transformer.image_tower.image_tower.encoder.layers.3.self_attn.k_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.3.self_attn.k_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.3.self_attn.out_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.3.self_attn.out_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.3.self_attn.q_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.3.self_attn.q_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.3.self_attn.v_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.3.self_attn.v_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.4.layer_norm1.bias', 'transformer.image_tower.image_tower.encoder.layers.4.layer_norm1.weight', 'transformer.image_tower.image_tower.encoder.layers.4.layer_norm2.bias', 'transformer.image_tower.image_tower.encoder.layers.4.layer_norm2.weight', 'transformer.image_tower.image_tower.encoder.layers.4.mlp.fc1.bias', 'transformer.image_tower.image_tower.encoder.layers.4.mlp.fc1.weight', 'transformer.image_tower.image_tower.encoder.layers.4.mlp.fc2.bias', 'transformer.image_tower.image_tower.encoder.layers.4.mlp.fc2.weight', 'transformer.image_tower.image_tower.encoder.layers.4.self_attn.k_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.4.self_attn.k_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.4.self_attn.out_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.4.self_attn.out_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.4.self_attn.q_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.4.self_attn.q_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.4.self_attn.v_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.4.self_attn.v_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.5.layer_norm1.bias', 'transformer.image_tower.image_tower.encoder.layers.5.layer_norm1.weight', 'transformer.image_tower.image_tower.encoder.layers.5.layer_norm2.bias', 'transformer.image_tower.image_tower.encoder.layers.5.layer_norm2.weight', 'transformer.image_tower.image_tower.encoder.layers.5.mlp.fc1.bias', 'transformer.image_tower.image_tower.encoder.layers.5.mlp.fc1.weight', 'transformer.image_tower.image_tower.encoder.layers.5.mlp.fc2.bias', 'transformer.image_tower.image_tower.encoder.layers.5.mlp.fc2.weight', 'transformer.image_tower.image_tower.encoder.layers.5.self_attn.k_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.5.self_attn.k_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.5.self_attn.out_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.5.self_attn.out_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.5.self_attn.q_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.5.self_attn.q_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.5.self_attn.v_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.5.self_attn.v_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.6.layer_norm1.bias', 'transformer.image_tower.image_tower.encoder.layers.6.layer_norm1.weight', 'transformer.image_tower.image_tower.encoder.layers.6.layer_norm2.bias', 'transformer.image_tower.image_tower.encoder.layers.6.layer_norm2.weight', 'transformer.image_tower.image_tower.encoder.layers.6.mlp.fc1.bias', 'transformer.image_tower.image_tower.encoder.layers.6.mlp.fc1.weight', 'transformer.image_tower.image_tower.encoder.layers.6.mlp.fc2.bias', 'transformer.image_tower.image_tower.encoder.layers.6.mlp.fc2.weight', 'transformer.image_tower.image_tower.encoder.layers.6.self_attn.k_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.6.self_attn.k_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.6.self_attn.out_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.6.self_attn.out_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.6.self_attn.q_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.6.self_attn.q_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.6.self_attn.v_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.6.self_attn.v_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.7.layer_norm1.bias', 'transformer.image_tower.image_tower.encoder.layers.7.layer_norm1.weight', 'transformer.image_tower.image_tower.encoder.layers.7.layer_norm2.bias', 'transformer.image_tower.image_tower.encoder.layers.7.layer_norm2.weight', 'transformer.image_tower.image_tower.encoder.layers.7.mlp.fc1.bias', 'transformer.image_tower.image_tower.encoder.layers.7.mlp.fc1.weight', 'transformer.image_tower.image_tower.encoder.layers.7.mlp.fc2.bias', 'transformer.image_tower.image_tower.encoder.layers.7.mlp.fc2.weight', 'transformer.image_tower.image_tower.encoder.layers.7.self_attn.k_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.7.self_attn.k_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.7.self_attn.out_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.7.self_attn.out_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.7.self_attn.q_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.7.self_attn.q_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.7.self_attn.v_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.7.self_attn.v_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.8.layer_norm1.bias', 'transformer.image_tower.image_tower.encoder.layers.8.layer_norm1.weight', 'transformer.image_tower.image_tower.encoder.layers.8.layer_norm2.bias', 'transformer.image_tower.image_tower.encoder.layers.8.layer_norm2.weight', 'transformer.image_tower.image_tower.encoder.layers.8.mlp.fc1.bias', 'transformer.image_tower.image_tower.encoder.layers.8.mlp.fc1.weight', 'transformer.image_tower.image_tower.encoder.layers.8.mlp.fc2.bias', 'transformer.image_tower.image_tower.encoder.layers.8.mlp.fc2.weight', 'transformer.image_tower.image_tower.encoder.layers.8.self_attn.k_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.8.self_attn.k_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.8.self_attn.out_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.8.self_attn.out_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.8.self_attn.q_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.8.self_attn.q_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.8.self_attn.v_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.8.self_attn.v_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.9.layer_norm1.bias', 'transformer.image_tower.image_tower.encoder.layers.9.layer_norm1.weight', 'transformer.image_tower.image_tower.encoder.layers.9.layer_norm2.bias', 'transformer.image_tower.image_tower.encoder.layers.9.layer_norm2.weight', 'transformer.image_tower.image_tower.encoder.layers.9.mlp.fc1.bias', 'transformer.image_tower.image_tower.encoder.layers.9.mlp.fc1.weight', 'transformer.image_tower.image_tower.encoder.layers.9.mlp.fc2.bias', 'transformer.image_tower.image_tower.encoder.layers.9.mlp.fc2.weight', 'transformer.image_tower.image_tower.encoder.layers.9.self_attn.k_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.9.self_attn.k_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.9.self_attn.out_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.9.self_attn.out_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.9.self_attn.q_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.9.self_attn.q_proj.weight', 'transformer.image_tower.image_tower.encoder.layers.9.self_attn.v_proj.bias', 'transformer.image_tower.image_tower.encoder.layers.9.self_attn.v_proj.weight', 'transformer.image_tower.image_tower.post_layernorm.bias', 'transformer.image_tower.image_tower.post_layernorm.weight', 'transformer.image_tower.image_tower.pre_layrnorm.bias', 'transformer.image_tower.image_tower.pre_layrnorm.weight']
- This IS expected if you are initializing LlavaQWenForCausalLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing LlavaQWenForCausalLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
LlavaQWenForCausalLM(
(transformer): LlavaQWenModel(
(wte): Embedding(152064, 5120)
(drop): Dropout(p=0.0, inplace=False)
(rotary_emb): RotaryEmbedding()
(h): ModuleList(
(0-39): 40 x QWenBlock(
(ln_1): RMSNorm()
(attn): QWenAttention(
(c_attn): Linear(in_features=5120, out_features=15360, bias=True)
(c_proj): Linear(in_features=5120, out_features=5120, bias=False)
(attn_dropout): Dropout(p=0.0, inplace=False)
)
(ln_2): RMSNorm()
(mlp): QWenMLP(
(w1): Linear(in_features=5120, out_features=13696, bias=False)
(w2): Linear(in_features=5120, out_features=13696, bias=False)
(c_proj): Linear(in_features=13696, out_features=5120, bias=False)
)
)
)
(ln_f): RMSNorm()
(image_tower): LanguageBindImageTower()
(mm_projector): build_projector(
(image_spatial_proj): Sequential(
(0): Linear(in_features=1024, out_features=5120, bias=True)
(1): GELU(approximate='none')
(2): Linear(in_features=5120, out_features=5120, bias=True)
)
(video_patch_proj): Identity()
(video_spatial_proj): Identity()
(video_temproal_proj): Identity()
(video_global_proj): Identity()
)
)
(lm_head): Linear(in_features=5120, out_features=152064, bias=False)
)
/opt/conda/envs/moellava/lib/python3.10/site-packages/torch/_utils.py:776: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
return self.fget.__get__(instance, owner)()
/opt/conda/envs/moellava/lib/python3.10/site-packages/torchvision/transforms/functional.py:1603: UserWarning: The default value of the antialias parameter of all the resizing transforms (Resize(), RandomResizedCrop(), etc.) will change from None to True in v0.17, in order to be consistent across the PIL and Tensor backends. To suppress this warning, directly pass antialias=True (recommended, future default), antialias=None (current default, which means False for Tensors and True for PIL), or antialias=False (only works on Tensors - PIL will still use antialiasing). This also applies if you are using the inference transforms from the models weights: update the call to weights.transforms(antialias=True).
warnings.warn(
ASSISTANT: What is the man in the picture doing?
2024-03-18 02:02:45.831 | WARNING | __main__:main:38 - ==================
tensor([[ 32, 6236, 1948, 264, 22208, 1196, 323, 458, 20443,
11229, 17847, 13, 576, 17847, 6696, 10950, 11, 11682,
11, 323, 47787, 11253, 311, 279, 1196, 594, 4755,
13, 13872, 25, 220, -200, 198, 45930, 101047, 102015,
18493, 106428, 30, 35560, 3846, 2821, 25]],
device='cuda:0')
tensor([[[[-1.7891, -1.7891, -1.7891, ..., -1.7930, -1.7930, -1.7891],
[-1.7627, -1.7676, -1.7627, ..., -1.7686, -1.7559, -1.7637],
[-1.7461, -1.7480, -1.7471, ..., -1.7490, -1.7363, -1.7520],
...,
[-1.7285, -1.7344, -1.6748, ..., -1.7461, -1.7266, -1.7402],
[-1.7686, -1.7510, -1.7715, ..., -1.7949, -1.7402, -1.7734],
[-1.7832, -1.7910, -1.7852, ..., -1.7891, -1.7900, -1.7930]],
[[-1.7539, -1.7500, -1.7422, ..., -1.7412, -1.7432, -1.7539],
[-1.7002, -1.7051, -1.7021, ..., -1.7422, -1.7246, -1.7119],
[-1.6758, -1.6797, -1.6826, ..., -1.6777, -1.6650, -1.6807],
...,
[-1.6445, -1.6914, -1.6289, ..., -1.7041, -1.6758, -1.6982],
[-1.7168, -1.7119, -1.7451, ..., -1.7383, -1.7158, -1.7324],
[-1.7432, -1.7471, -1.7451, ..., -1.7490, -1.7529, -1.7510]],
[[-1.4814, -1.4814, -1.4814, ..., -1.4834, -1.4834, -1.4834],
[-1.4434, -1.4473, -1.4463, ..., -1.4082, -1.3926, -1.3936],
[-1.3838, -1.3867, -1.3867, ..., -1.3975, -1.3994, -1.3994],
...,
[-1.4268, -1.4434, -1.3691, ..., -1.4326, -1.4082, -1.4443],
[-1.4775, -1.4551, -1.4697, ..., -1.4756, -1.4756, -1.4590],
[-1.4629, -1.4678, -1.4668, ..., -1.4697, -1.4736, -1.4805]]]],
device='cuda:0', dtype=torch.float16)
2024-03-18 02:02:45.843 | WARNING | __main__:main:41 - ==================
++++++++++++++++
tensor([[[-0.9653, 0.5757, -1.1807, ..., 1.2803, -0.7188, -0.8818],
[ 0.4961, 2.7051, 0.0115, ..., 0.6382, -0.6060, -0.5703],
[ 0.1807, 0.8447, 0.4824, ..., 1.0771, -0.0136, -1.2354],
...,
[ 0.6416, 1.0879, -0.5303, ..., 1.1309, -0.9102, -0.0253],
[ 0.3801, 3.1152, -0.9663, ..., -0.0643, -0.4917, 1.3672],
[-0.8354, 0.7363, -1.6709, ..., 1.4736, -0.3210, -0.8779]]],
device='cuda:0', dtype=torch.float16)
image_feature_shape: torch.Size([1, 256, 1024])
Traceback (most recent call last):
File "/code/MoE-LLaVA/predict.py", line 57, in <module>
main()
File "/code/MoE-LLaVA/predict.py", line 44, in main
output_ids = model.generate(
File "/code/MoE-LLaVA/moellava/model/language_model/qwen/modeling_qwen.py", line 1260, in generate
return super().generate(
File "/opt/conda/envs/moellava/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/opt/conda/envs/moellava/lib/python3.10/site-packages/transformers/generation/utils.py", line 1520, in generate
return self.sample(
File "/opt/conda/envs/moellava/lib/python3.10/site-packages/transformers/generation/utils.py", line 2617, in sample
outputs = self(
File "/opt/conda/envs/moellava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/code/MoE-LLaVA/moellava/model/language_model/llava_qwen.py", line 147, in forward
) = self.prepare_inputs_labels_for_multimodal(
File "/code/MoE-LLaVA/moellava/model/llava_arch.py", line 458, in prepare_inputs_labels_for_multimodal
image_features_minibatch = self.encode_images(images_minibatch) # [mini_b, l, c]
File "/code/MoE-LLaVA/moellava/model/llava_arch.py", line 155, in encode_images
image_features = self.get_model().mm_projector.forward_image(image_features)
File "/code/MoE-LLaVA/moellava/model/multimodal_projector/builder.py", line 140, in forward_image
return self.image_spatial_proj(image_feature)
File "/opt/conda/envs/moellava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/envs/moellava/lib/python3.10/site-packages/torch/nn/modules/container.py", line 217, in forward
input = module(input)
File "/opt/conda/envs/moellava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/envs/moellava/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 114, in forward
return F.linear(input, self.weight, self.bias)
RuntimeError: mat1 and mat2 must have the same dtype
pretrain.sh
JSON_FOLDER="/data/llava_pt/json"
IMAGE_FOLDER="/data"
# cd ~/MoE-LLaVA
# HF_DATASETS_OFFLINE=1 TRANSFORMERS_OFFLINE=1
CUDA_VISIBLE_DEVICES=0,1,2,3 HF_DATASETS_OFFLINE=1 TRANSFORMERS_OFFLINE=1 deepspeed moellava/train/train_mem.py \
--deepspeed ./scripts/zero2.json \
--model_name_or_path /model/Qwen-14B \
--version plain \
--data_path ${JSON_FOLDER}/llava_image_.json \
--image_folder ${IMAGE_FOLDER} \
--image_tower /model/LanguageBind/LanguageBind_Image \
--image_projector_type mlp2x_gelu \
--tune_mm_mlp_adapter True \
--mm_vision_select_layer -2 \
--mm_use_im_start_end False \
--mm_use_im_patch_token False \
--bf16 True \
--output_dir /output/llavaqwen-14b-pretrain \
--num_train_epochs 1.5 \
--per_device_train_batch_size 32 \
--per_device_eval_batch_size 4 \
--gradient_accumulation_steps 2 \
--evaluation_strategy "no" \
--save_strategy "steps" \
--save_steps 500 \
--save_total_limit 1 \
--learning_rate 1e-3 \
--weight_decay 0. \
--warmup_ratio 0.03 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--tf32 True \
--model_max_length 4096 \
--gradient_checkpointing True \
--dataloader_num_workers 8 \
--lazy_preprocess True \
--report_to tensorboard \
--cache_dir "./cache_dir"
predict.py
import torch
from PIL import Image
from moellava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN
from moellava.conversation import conv_templates, SeparatorStyle
from moellava.model.builder import load_pretrained_model
from moellava.utils import disable_torch_init
from moellava.mm_utils import tokenizer_image_token, get_model_name_from_path, KeywordsStoppingCriteria
from loguru import logger as log
def main():
disable_torch_init()
# image = 'moellava/serve/examples/extreme_ironing.jpg'
# inp = 'What is unusual about this image?'
image = '/data/lrv_tune/images/2371990.jpg'
inp = 'What is the man in the picture doing?'
model_path = '/output/llava-qwen14/checkpoint-200/' # choose a model
device = 'cuda'
load_4bit, load_8bit = False, False
model_name = get_model_name_from_path(model_path)
tokenizer, model, processor, context_len = load_pretrained_model(model_path, None, model_name, load_8bit, load_4bit, device=device)
image_processor = processor['image']
conv_mode = "qwen" # phi or qwen or stablelm
conv = conv_templates[conv_mode].copy()
roles = conv.roles
image_tensor = image_processor.preprocess(Image.open(image).convert('RGB'), return_tensors='pt')['pixel_values'].to(model.device, dtype=torch.float16)
print(f"{roles[1]}: {inp}")
inp = DEFAULT_IMAGE_TOKEN + '\n' + inp
conv.append_message(conv.roles[0], inp)
conv.append_message(conv.roles[1], None)
prompt = conv.get_prompt()
input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors='pt').unsqueeze(0).cuda()
stop_str = conv.sep if conv.sep_style != SeparatorStyle.TWO else conv.sep2
keywords = [stop_str]
stopping_criteria = KeywordsStoppingCriteria(keywords, tokenizer, input_ids)
log.warning("==================")
print(input_ids)
print(image_tensor)
log.warning("==================")
with torch.inference_mode():
output_ids = model.generate(
input_ids,
images=image_tensor,
do_sample=True,
temperature=0.2,
max_new_tokens=1024,
use_cache=True,
stopping_criteria=[stopping_criteria])
outputs = tokenizer.decode(output_ids[0, input_ids.shape[1]:], skip_special_tokens=True).strip()
print(outputs)
if __name__ == '__main__':
main()
'''
deepspeed predict.py
'''
finetune.sh
#!/bin/bash
JSON_FOLDER="/data/lrv_tune/json"
IMAGE_FOLDER="/data"
cd /code/MoE-LLaVA
CUDA_VISIBLE_DEVICES=0,1,2,3 HF_DATASETS_OFFLINE=1 TRANSFORMERS_OFFLINE=1 deepspeed moellava/train/train_mem.py \
--deepspeed ./scripts/zero2_offload.json \
--model_name_or_path /model/Qwen-14B \
--version qwen \
--data_path ${JSON_FOLDER}/chinese_lrv_tune_50k.json \
--image_folder ${IMAGE_FOLDER} \
--image_tower /model/LanguageBind/LanguageBind_Image \
--image_projector_type mlp2x_gelu \
--pretrain_mm_mlp_adapter /output/llavaqwen-14b-pretrain/mm_projector.bin \
--mm_vision_select_layer -2 \
--mm_use_im_start_end False \
--mm_use_im_patch_token False \
--image_aspect_ratio pad \
--group_by_modality_length True \
--bf16 True \
--output_dir /output/llava-qwen14 \
--num_train_epochs 2.3 \
--per_device_train_batch_size 16 \
--per_device_eval_batch_size 4 \
--gradient_accumulation_steps 2 \
--evaluation_strategy "no" \
--save_strategy "steps" \
--save_steps 200 \
--save_total_limit 1 \
--learning_rate 2e-5 \
--weight_decay 0. \
--warmup_ratio 0.03 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--tf32 True \
--model_max_length 4096 \
--gradient_checkpointing True \
--dataloader_num_workers 16 \
--lazy_preprocess True \
--report_to tensorboard \
--cache_dir "./cache_dir"