🎉Support for finetuning of Qwen2-VL-Chat series models
🎉The finetuning(VQA/OCR/Grounding/Video) for Qwen2-VL-Chat series models has been supported, please check the documentation below for details:
English
https://github.com/modelscope/ms-swift/blob/main/docs/source_en/Multi-Modal/qwen2-vl-best-practice.md
Chinese
https://github.com/modelscope/ms-swift/blob/main/docs/source/Multi-Modal/qwen2-vl%E6%9C%80%E4%BD%B3%E5%AE%9E%E8%B7%B5.md
qwen2-vl need transformers>=4.45.0.dev0,but swift need transformers<4.45.0, how to fixed it?
qwen2-vl need transformers>=4.45.0.dev0,but swift need transformers<4.45.0, how to fixed it?
After installing swift, pip install git+https://github.com/huggingface/transformers.git
Qwen2-VL seems to not compatible with FlashAttention? When I add "--use_flash_attn True", I encountered this error (CUDA_LAUNCH_BLOCKING was enabled to print the exact trace):
[rank0]: File "***/lib/python3.10/site-packages/transformers/models/qwen2_vl/modeling_qwen2_vl.py", line 182, in apply_multimodal_rotary_pos_emb
[rank0]: cos = cos[position_ids]
[rank0]: RuntimeError: CUDA error: device-side assert triggered
[rank0]: Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
EDIT: There seems to be a problem of the multimodal rotary position embedding introduced by Qwen2 VL. Even turn off Flash Attention I still encounter this error
Qwen2-VL seems to not compatible with FlashAttention? When I add "--use_flash_attn True", I encountered this error (CUDA_LAUNCH_BLOCKING was enabled to print the exact trace):
[rank0]: File "***/lib/python3.10/site-packages/transformers/models/qwen2_vl/modeling_qwen2_vl.py", line 182, in apply_multimodal_rotary_pos_emb [rank0]: cos = cos[position_ids] [rank0]: RuntimeError: CUDA error: device-side assert triggered [rank0]: Compile with
TORCH_USE_CUDA_DSAto enable device-side assertions.EDIT: There seems to be a problem of the multimodal rotary position embedding introduced by Qwen2 VL. Even turn off Flash Attention I still encounter this error
I do think this may be a bug in qwen2 code, I tried to fix by(modeling_qwen2_vl.py):
This works
It did work, thanks.
Error occurs when i finetune with lora in V100:
RuntimeError: CUDA error: too many resources requested for launch CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
Could you provide the complete modeling_qwen2_vl.py file? I encountered an error while fine-tuning qwen2-vl-2b-instruct.
File "./miniconda3/envs/swift/lib/python3.11/site-packages/transformers/models/qwen2_vl/modeling_qwen2_vl.py", line 296, in forward
[rank1]: attn_weights = torch.matmul(q, k.transpose(1, 2)) / math.sqrt(self.head_dim)
AttributeError: 'VisionAttention' object has no attribute 'head_dim'
Added an example of single-card A10 fine-tuning: https://github.com/modelscope/ms-swift/blob/main/docs/source_en/Multi-Modal/qwen2-vl-best-practice.md#image-ocr-fine-tuning
我在A40上进行微调时,显存出现无限增长的问题导致CUDA out of memory
CUDA_VISIBLE_DEVICES=3 swift sft
--model_type qwen2-vl-7b-instruct
--model_id_or_path qwen/Qwen2-VL-7B-Instruct
--sft_type lora
--dataset dataset.json
这是我使用的命令
https://github.com/modelscope/ms-swift/issues/1860
You can save memory by reducing SIZE_FACTOR=8 and MAX_PIXELS=602112.
参考这里:https://swift.readthedocs.io/zh-cn/latest/Multi-Modal/qwen2-vl%E6%9C%80%E4%BD%B3%E5%AE%9E%E8%B7%B5.html#ocr
Full parameter fine-tuning & freeze_vit support reference:
https://github.com/modelscope/ms-swift/issues/1879
I want to train it with text-only dataset to improve the ability to answer dialogue questions, do I need to set --freeze_vit true and the training dataset is text-only dialogue (no images) right?
I understand that --freeze_vit true will not train vision-related parameters, is that correct? cc @Jintao-Huang @tastelikefeet
--freeze_vit true will not train the parameters of vit (when sft_type is full).
@Jintao-Huang thanks for explaining, and please help me confirm this is correct?
I want to train it with text-only dataset to improve the ability to answer dialogue questions, do I need to set --freeze_vit true and the training dataset is text-only dialogue (no images) right? (if using lora)
@Jintao-Huang thanks for explaining, and please help me confirm this is correct?
I want to train it with text-only dataset to improve the ability to answer dialogue questions, do I need to set --freeze_vit true and the training dataset is text-only dialogue (no images) right? (if using lora)
By default, Lora does not train the content of the ViT. In your case, there is no need to enable ViT for training.
Fine-tuning best practices for qwen2.5-72b-instruct and qwen2-vl-72b-instruct:
https://github.com/modelscope/ms-swift/issues/2064
many thanks @Jintao-Huang
swift3, please refer to here: https://github.com/modelscope/ms-swift/tree/main/examples/train/multimodal