🎉The finetuning(VQA/OCR/Grounding/Video) for Qwen2-VL-Chat series models has been supported, please check the documentation below for details:

English

https://github.com/modelscope/ms-swift/blob/main/docs/source_en/Multi-Modal/qwen2-vl-best-practice.md

Chinese

https://github.com/modelscope/ms-swift/blob/main/docs/source/Multi-Modal/qwen2-vl%E6%9C%80%E4%BD%B3%E5%AE%9E%E8%B7%B5.md

Aug 29 '24 16:08 tastelikefeet

qwen2-vl need transformers>=4.45.0.dev0，but swift need transformers<4.45.0, how to fixed it?

Aug 30 '24 01:08 Halflifefa

qwen2-vl need transformers>=4.45.0.dev0，but swift need transformers<4.45.0, how to fixed it?

After installing swift, pip install git+https://github.com/huggingface/transformers.git

Aug 30 '24 02:08 tastelikefeet

Qwen2-VL seems to not compatible with FlashAttention? When I add "--use_flash_attn True", I encountered this error (CUDA_LAUNCH_BLOCKING was enabled to print the exact trace):

[rank0]: File "***/lib/python3.10/site-packages/transformers/models/qwen2_vl/modeling_qwen2_vl.py", line 182, in apply_multimodal_rotary_pos_emb [rank0]: cos = cos[position_ids] [rank0]: RuntimeError: CUDA error: device-side assert triggered [rank0]: Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

EDIT: There seems to be a problem of the multimodal rotary position embedding introduced by Qwen2 VL. Even turn off Flash Attention I still encounter this error

Aug 30 '24 02:08 VietDunghacker

Qwen2-VL seems to not compatible with FlashAttention? When I add "--use_flash_attn True", I encountered this error (CUDA_LAUNCH_BLOCKING was enabled to print the exact trace):

[rank0]: File "***/lib/python3.10/site-packages/transformers/models/qwen2_vl/modeling_qwen2_vl.py", line 182, in apply_multimodal_rotary_pos_emb [rank0]: cos = cos[position_ids] [rank0]: RuntimeError: CUDA error: device-side assert triggered [rank0]: Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

EDIT: There seems to be a problem of the multimodal rotary position embedding introduced by Qwen2 VL. Even turn off Flash Attention I still encounter this error

I do think this may be a bug in qwen2 code, I tried to fix by(modeling_qwen2_vl.py): This works

Aug 30 '24 03:08 tastelikefeet

It did work, thanks.

Aug 30 '24 03:08 VietDunghacker

Error occurs when i finetune with lora in V100：

RuntimeError: CUDA error: too many resources requested for launch CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Aug 30 '24 08:08 wade30822

Could you provide the complete modeling_qwen2_vl.py file? I encountered an error while fine-tuning qwen2-vl-2b-instruct.
File "./miniconda3/envs/swift/lib/python3.11/site-packages/transformers/models/qwen2_vl/modeling_qwen2_vl.py", line 296, in forward [rank1]: attn_weights = torch.matmul(q, k.transpose(1, 2)) / math.sqrt(self.head_dim) AttributeError: 'VisionAttention' object has no attribute 'head_dim'

Aug 30 '24 08:08 Betty-J

Added an example of single-card A10 fine-tuning: https://github.com/modelscope/ms-swift/blob/main/docs/source_en/Multi-Modal/qwen2-vl-best-practice.md#image-ocr-fine-tuning

Aug 30 '24 08:08 Jintao-Huang

我在A40上进行微调时，显存出现无限增长的问题导致CUDA out of memory CUDA_VISIBLE_DEVICES=3 swift sft
--model_type qwen2-vl-7b-instruct
--model_id_or_path qwen/Qwen2-VL-7B-Instruct
--sft_type lora
--dataset dataset.json 这是我使用的命令

Aug 30 '24 08:08 KirbytroNic0528

https://github.com/modelscope/ms-swift/issues/1860

You can save memory by reducing SIZE_FACTOR=8 and MAX_PIXELS=602112.

参考这里：https://swift.readthedocs.io/zh-cn/latest/Multi-Modal/qwen2-vl%E6%9C%80%E4%BD%B3%E5%AE%9E%E8%B7%B5.html#ocr

Aug 31 '24 08:08 Jintao-Huang

Full parameter fine-tuning & freeze_vit support reference:

https://github.com/modelscope/ms-swift/issues/1879

Aug 31 '24 13:08 Jintao-Huang

I want to train it with text-only dataset to improve the ability to answer dialogue questions, do I need to set --freeze_vit true and the training dataset is text-only dialogue (no images) right? I understand that --freeze_vit true will not train vision-related parameters, is that correct? cc @Jintao-Huang @tastelikefeet

Sep 17 '24 15:09 lh0x00

--freeze_vit true will not train the parameters of vit (when sft_type is full).

Sep 18 '24 13:09 Jintao-Huang

@Jintao-Huang thanks for explaining, and please help me confirm this is correct?

I want to train it with text-only dataset to improve the ability to answer dialogue questions, do I need to set --freeze_vit true and the training dataset is text-only dialogue (no images) right? (if using lora)

Sep 18 '24 14:09 lh0x00

@Jintao-Huang thanks for explaining, and please help me confirm this is correct?

I want to train it with text-only dataset to improve the ability to answer dialogue questions, do I need to set --freeze_vit true and the training dataset is text-only dialogue (no images) right? (if using lora)

By default, Lora does not train the content of the ViT. In your case, there is no need to enable ViT for training.

Sep 18 '24 15:09 Jintao-Huang

Fine-tuning best practices for qwen2.5-72b-instruct and qwen2-vl-72b-instruct:

https://github.com/modelscope/ms-swift/issues/2064

Sep 18 '24 15:09 Jintao-Huang

many thanks @Jintao-Huang

Sep 18 '24 16:09 lh0x00

swift3, please refer to here: https://github.com/modelscope/ms-swift/tree/main/examples/train/multimodal

Feb 05 '25 07:02 Jintao-Huang

🎉Support for finetuning of Qwen2-VL-Chat series models

English

Chinese