ms-swift icon indicating copy to clipboard operation
ms-swift copied to clipboard

🎉Support for finetuning of Qwen2-VL-Chat series models

Open tastelikefeet opened this issue 1 year ago • 17 comments

🎉The finetuning(VQA/OCR/Grounding/Video) for Qwen2-VL-Chat series models has been supported, please check the documentation below for details:

English

https://github.com/modelscope/ms-swift/blob/main/docs/source_en/Multi-Modal/qwen2-vl-best-practice.md

Chinese

https://github.com/modelscope/ms-swift/blob/main/docs/source/Multi-Modal/qwen2-vl%E6%9C%80%E4%BD%B3%E5%AE%9E%E8%B7%B5.md

tastelikefeet avatar Aug 29 '24 16:08 tastelikefeet

qwen2-vl need transformers>=4.45.0.dev0,but swift need transformers<4.45.0, how to fixed it?

Halflifefa avatar Aug 30 '24 01:08 Halflifefa

qwen2-vl need transformers>=4.45.0.dev0,but swift need transformers<4.45.0, how to fixed it?

After installing swift, pip install git+https://github.com/huggingface/transformers.git

tastelikefeet avatar Aug 30 '24 02:08 tastelikefeet

Qwen2-VL seems to not compatible with FlashAttention? When I add "--use_flash_attn True", I encountered this error (CUDA_LAUNCH_BLOCKING was enabled to print the exact trace):

[rank0]: File "***/lib/python3.10/site-packages/transformers/models/qwen2_vl/modeling_qwen2_vl.py", line 182, in apply_multimodal_rotary_pos_emb [rank0]: cos = cos[position_ids] [rank0]: RuntimeError: CUDA error: device-side assert triggered [rank0]: Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

EDIT: There seems to be a problem of the multimodal rotary position embedding introduced by Qwen2 VL. Even turn off Flash Attention I still encounter this error

VietDunghacker avatar Aug 30 '24 02:08 VietDunghacker

Qwen2-VL seems to not compatible with FlashAttention? When I add "--use_flash_attn True", I encountered this error (CUDA_LAUNCH_BLOCKING was enabled to print the exact trace):

[rank0]: File "***/lib/python3.10/site-packages/transformers/models/qwen2_vl/modeling_qwen2_vl.py", line 182, in apply_multimodal_rotary_pos_emb [rank0]: cos = cos[position_ids] [rank0]: RuntimeError: CUDA error: device-side assert triggered [rank0]: Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

EDIT: There seems to be a problem of the multimodal rotary position embedding introduced by Qwen2 VL. Even turn off Flash Attention I still encounter this error

I do think this may be a bug in qwen2 code, I tried to fix by(modeling_qwen2_vl.py): image This works

tastelikefeet avatar Aug 30 '24 03:08 tastelikefeet

It did work, thanks.

VietDunghacker avatar Aug 30 '24 03:08 VietDunghacker

Error occurs when i finetune with lora in V100:

RuntimeError: CUDA error: too many resources requested for launch CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

wade30822 avatar Aug 30 '24 08:08 wade30822

Could you provide the complete modeling_qwen2_vl.py file? I encountered an error while fine-tuning qwen2-vl-2b-instruct.
File "./miniconda3/envs/swift/lib/python3.11/site-packages/transformers/models/qwen2_vl/modeling_qwen2_vl.py", line 296, in forward [rank1]: attn_weights = torch.matmul(q, k.transpose(1, 2)) / math.sqrt(self.head_dim) AttributeError: 'VisionAttention' object has no attribute 'head_dim'

Betty-J avatar Aug 30 '24 08:08 Betty-J

Added an example of single-card A10 fine-tuning: https://github.com/modelscope/ms-swift/blob/main/docs/source_en/Multi-Modal/qwen2-vl-best-practice.md#image-ocr-fine-tuning

Jintao-Huang avatar Aug 30 '24 08:08 Jintao-Huang

我在A40上进行微调时,显存出现无限增长的问题导致CUDA out of memory CUDA_VISIBLE_DEVICES=3 swift sft
--model_type qwen2-vl-7b-instruct
--model_id_or_path qwen/Qwen2-VL-7B-Instruct
--sft_type lora
--dataset dataset.json 这是我使用的命令

KirbytroNic0528 avatar Aug 30 '24 08:08 KirbytroNic0528

https://github.com/modelscope/ms-swift/issues/1860

You can save memory by reducing SIZE_FACTOR=8 and MAX_PIXELS=602112.

参考这里:https://swift.readthedocs.io/zh-cn/latest/Multi-Modal/qwen2-vl%E6%9C%80%E4%BD%B3%E5%AE%9E%E8%B7%B5.html#ocr

Jintao-Huang avatar Aug 31 '24 08:08 Jintao-Huang

Full parameter fine-tuning & freeze_vit support reference:

https://github.com/modelscope/ms-swift/issues/1879

Jintao-Huang avatar Aug 31 '24 13:08 Jintao-Huang

I want to train it with text-only dataset to improve the ability to answer dialogue questions, do I need to set --freeze_vit true and the training dataset is text-only dialogue (no images) right? I understand that --freeze_vit true will not train vision-related parameters, is that correct? cc @Jintao-Huang @tastelikefeet

lh0x00 avatar Sep 17 '24 15:09 lh0x00

--freeze_vit true will not train the parameters of vit (when sft_type is full).

Jintao-Huang avatar Sep 18 '24 13:09 Jintao-Huang

@Jintao-Huang thanks for explaining, and please help me confirm this is correct?

I want to train it with text-only dataset to improve the ability to answer dialogue questions, do I need to set --freeze_vit true and the training dataset is text-only dialogue (no images) right? (if using lora)

lh0x00 avatar Sep 18 '24 14:09 lh0x00

@Jintao-Huang thanks for explaining, and please help me confirm this is correct?

I want to train it with text-only dataset to improve the ability to answer dialogue questions, do I need to set --freeze_vit true and the training dataset is text-only dialogue (no images) right? (if using lora)

By default, Lora does not train the content of the ViT. In your case, there is no need to enable ViT for training.

Jintao-Huang avatar Sep 18 '24 15:09 Jintao-Huang

Fine-tuning best practices for qwen2.5-72b-instruct and qwen2-vl-72b-instruct:

https://github.com/modelscope/ms-swift/issues/2064

Jintao-Huang avatar Sep 18 '24 15:09 Jintao-Huang

many thanks @Jintao-Huang

lh0x00 avatar Sep 18 '24 16:09 lh0x00

swift3, please refer to here: https://github.com/modelscope/ms-swift/tree/main/examples/train/multimodal

Jintao-Huang avatar Feb 05 '25 07:02 Jintao-Huang