Yi Question about whether Yi-VL-6B can be fine-tuned using its own dataset

Reminder

[X] I have searched the Github Discussion and issues and have not found anything similar to this.

Motivation

The low footprint of Yi-VL's video memory and the high speed of its inference allows room for more utility. If the Yi-VL series of multimodal macromodels can be fine-tuned using its own dataset, it many projects will be a great leap forward!

Solution

No response

Alternatives

No response

Anything Else?

No response

Are you willing to submit a PR?

[x] I'm willing to submit a PR!

Jan 24 '24 07:01 a2382625920

hello~

yi-vl-6b is an excellent performing model, and the ms-swift LLM training framework has incorporated sft for yi-vl. It provides example scripts and supports custom datasets. You can check it out here~ https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/yi_vl_6b_chat

The script utilizes the COCO dataset for fine-tuning. After training, the generated samples are as follows:

"""
[PROMPT]This is a chat between an inquisitive human and an AI assistant. Assume the role of the AI assistant. Read all the images carefully, and respond to the human's questions with informative, helpful, detailed and polite answers. 这是一个好奇的人类和一个人工智能助手之间的对话。假设你扮演这个AI助手的角色。仔细阅读所有的图像，并对人类的问题做出信息丰富、有帮助、详细的和礼貌的回答。

### Human: [-200 * 1]
please describe the image.
### Assistant:
[OUTPUT]A large airplane is on display in a museum. 
###

[LABELS]People walking in a museum with a airplane hanging from the celing.
[IMAGES]['https://xingchen-data.oss-cn-zhangjiakou.aliyuncs.com/coco/2014/val2014/COCO_val2014_000000492132.jpg']
--------------------------------------------------------------------
[PROMPT]This is a chat between an inquisitive human and an AI assistant. Assume the role of the AI assistant. Read all the images carefully, and respond to the human's questions with informative, helpful, detailed and polite answers. 这是一个好奇的人类和一个人工智能助手之间的对话。假设你扮演这个AI助手的角色。仔细阅读所有的图像，并对人类的问题做出信息丰富、有帮助、详细的和礼貌的回答。

### Human: [-200 * 1]
please describe the image.
### Assistant:
[OUTPUT]A bowl of fruit and cake next to a cup of coffee. 
###

[LABELS]a bowl of fruit and pastry on a table
[IMAGES]['https://xingchen-data.oss-cn-zhangjiakou.aliyuncs.com/coco/2014/val2014/COCO_val2014_000000558642.jpg']
"""

Jan 27 '24 04:01 Jintao-Huang

I added the finetuning scripts. See #368

Jan 31 '24 07:01 minlik

hello~

yi-vl-6b is an excellent performing model, and the ms-swift LLM training framework has incorporated sft for yi-vl. It provides example scripts and supports custom datasets. You can check it out here~ https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/yi_vl_6b_chat

The script utilizes the COCO dataset for fine-tuning. After training, the generated samples are as follows:

"""
[PROMPT]This is a chat between an inquisitive human and an AI assistant. Assume the role of the AI assistant. Read all the images carefully, and respond to the human's questions with informative, helpful, detailed and polite answers. 这是一个好奇的人类和一个人工智能助手之间的对话。假设你扮演这个AI助手的角色。仔细阅读所有的图像，并对人类的问题做出信息丰富、有帮助、详细的和礼貌的回答。

### Human: [-200 * 1]
please describe the image.
### Assistant:
[OUTPUT]A large airplane is on display in a museum. 
###

[LABELS]People walking in a museum with a airplane hanging from the celing.
[IMAGES]['https://xingchen-data.oss-cn-zhangjiakou.aliyuncs.com/coco/2014/val2014/COCO_val2014_000000492132.jpg']
--------------------------------------------------------------------
[PROMPT]This is a chat between an inquisitive human and an AI assistant. Assume the role of the AI assistant. Read all the images carefully, and respond to the human's questions with informative, helpful, detailed and polite answers. 这是一个好奇的人类和一个人工智能助手之间的对话。假设你扮演这个AI助手的角色。仔细阅读所有的图像，并对人类的问题做出信息丰富、有帮助、详细的和礼貌的回答。

### Human: [-200 * 1]
please describe the image.
### Assistant:
[OUTPUT]A bowl of fruit and cake next to a cup of coffee. 
###

[LABELS]a bowl of fruit and pastry on a table
[IMAGES]['https://xingchen-data.oss-cn-zhangjiakou.aliyuncs.com/coco/2014/val2014/COCO_val2014_000000558642.jpg']
"""

Thanks, I'm swift trying to register the dataset and train with the already downloaded model!

Jan 31 '24 07:01 a2382625920

I added the finetuning scripts. See #368

Thank you very much for your work, I will try to use it!

Jan 31 '24 07:01 a2382625920

I added the finetuning scripts. See #368

Hi, I tried the method you provided and it comes up with the following warning, which may have an effect on the final fine-tuned result:

WARNING: tokenization mismatch: 208 vs. 210. (ignored) WARNING: tokenization mismatch: 220 vs. 222. (ignored) WARNING: tokenization mismatch: 174 vs. 176. (ignored) WARNING: tokenization mismatch: 219 vs. 221. (ignored) WARNING: tokenization mismatch: 197 vs. 199. (ignored) WARNING: tokenization mismatch: 216 vs. 218. (ignored) WARNING: tokenization mismatch: 196 vs. 198. (ignored) WARNING: tokenization mismatch: 222 vs. 224. (ignored) WARNING: tokenization mismatch: 180 vs. 182. (ignored) WARNING: tokenization mismatch: 219 vs. 221. (ignored) WARNING: tokenization mismatch: 233 vs. 235. (ignored) WARNING: tokenization mismatch: 177 vs. 179. (ignored) WARNING: tokenization mismatch: 195 vs. 197. (ignored) WARNING: tokenization mismatch: 227 vs. 229. (ignored) WARNING: tokenization mismatch: 226 vs. 228. (ignored) WARNING: tokenization mismatch: 221 vs. 223. (ignored) WARNING: tokenization mismatch: 178 vs. 180. (ignored) WARNING: tokenization mismatch: 237 vs. 239. (ignored) WARNING: tokenization mismatch: 178 vs. 180. (ignored) WARNING: tokenization mismatch: 227 vs. 229. (ignored) WARNING: tokenization mismatch: 175 vs. 177. (ignored) WARNING: tokenization mismatch: 222 vs. 224. (ignored) WARNING: tokenization mismatch: 215 vs. 217. (ignored) WARNING: tokenization mismatch: 217 vs. 219. (ignored) WARNING: tokenization mismatch: 220 vs. 222. (ignored) WARNING: tokenization mismatch: 215 vs. 217. (ignored) WARNING: tokenization mismatch: 227 vs. 229. (ignored) WARNING: tokenization mismatch: 178 vs. 180. (ignored) WARNING: tokenization mismatch: 235 vs. 237. (ignored) WARNING: tokenization mismatch: 177 vs. 179. (ignored) WARNING: tokenization mismatch: 221 vs. 223. (ignored) WARNING: tokenization mismatch: 197 vs. 199. (ignored) WARNING: tokenization mismatch: 220 vs. 222. (ignored) WARNING: tokenization mismatch: 176 vs. 178. (ignored) WARNING: tokenization mismatch: 228 vs. 230. (ignored) WARNING: tokenization mismatch: 221 vs. 223. (ignored) WARNING: tokenization mismatch: 175 vs. 177. (ignored) WARNING: tokenization mismatch: 220 vs. 222. (ignored) WARNING: tokenization mismatch: 177 vs. 179. (ignored) WARNING: tokenization mismatch: 176 vs. 178. (ignored) WARNING: tokenization mismatch: 176 vs. 178. (ignored) WARNING: tokenization mismatch: 229 vs. 231. (ignored) WARNING: tokenization mismatch: 221 vs. 223. (ignored) WARNING: tokenization mismatch: 177 vs. 179. (ignored) WARNING: tokenization mismatch: 220 vs. 222. (ignored) WARNING: tokenization mismatch: 227 vs. 229. (ignored) WARNING: tokenization mismatch: 178 vs. 180. (ignored) WARNING: tokenization mismatch: 233 vs. 235. (ignored) WARNING: tokenization mismatch: 176 vs. 178. (ignored) WARNING: tokenization mismatch: 175 vs. 177. (ignored) WARNING: tokenization mismatch: 230 vs. 232. (ignored)

Feb 01 '24 05:02 a2382625920

hello~

yi-vl-6b is an excellent performing model, and the ms-swift LLM training framework has incorporated sft for yi-vl. It provides example scripts and supports custom datasets. You can check it out here~ https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/yi_vl_6b_chat

The script utilizes the COCO dataset for fine-tuning. After training, the generated samples are as follows:

"""
[PROMPT]This is a chat between an inquisitive human and an AI assistant. Assume the role of the AI assistant. Read all the images carefully, and respond to the human's questions with informative, helpful, detailed and polite answers. 这是一个好奇的人类和一个人工智能助手之间的对话。假设你扮演这个AI助手的角色。仔细阅读所有的图像，并对人类的问题做出信息丰富、有帮助、详细的和礼貌的回答。

### Human: [-200 * 1]
please describe the image.
### Assistant:
[OUTPUT]A large airplane is on display in a museum. 
###

[LABELS]People walking in a museum with a airplane hanging from the celing.
[IMAGES]['https://xingchen-data.oss-cn-zhangjiakou.aliyuncs.com/coco/2014/val2014/COCO_val2014_000000492132.jpg']
--------------------------------------------------------------------
[PROMPT]This is a chat between an inquisitive human and an AI assistant. Assume the role of the AI assistant. Read all the images carefully, and respond to the human's questions with informative, helpful, detailed and polite answers. 这是一个好奇的人类和一个人工智能助手之间的对话。假设你扮演这个AI助手的角色。仔细阅读所有的图像，并对人类的问题做出信息丰富、有帮助、详细的和礼貌的回答。

### Human: [-200 * 1]
please describe the image.
### Assistant:
[OUTPUT]A bowl of fruit and cake next to a cup of coffee. 
###

[LABELS]a bowl of fruit and pastry on a table
[IMAGES]['https://xingchen-data.oss-cn-zhangjiakou.aliyuncs.com/coco/2014/val2014/COCO_val2014_000000558642.jpg']
"""

I don't know how I can register my own dataset. And this doesn't allow me to enter my own local model path, after I enter the local model path, swift still uses the network to download the model. How can I solve this?

Feb 01 '24 05:02 a2382625920

WARNING: tokenization mismatch

I didn't encounter the same issue. Could you please share your training scripts?

After reviewing the code, I noticed that the WARNING might be caused by the commented code here. Could you please checke your local code?

Feb 01 '24 06:02 minlik

@a2382625920 I thought of another possibility. The training code is modified from LLaVA. If you have installed llava locally, you can uninstall it and try again.

Feb 01 '24 06:02 minlik

我想到了另一种可能性。训练代码是从 LLaVA 修改而来的。如果您已在本地安装了 llava，则可以将其卸载并重试。

I did use llava's virtual environment to run the code, and after I uninstalled it and installed Yi's installer environment, the following error was reported:

Traceback (most recent call last): File "/root/siton-glusterfs-eaxtsxdfs/hzt/projects/Yi/VL/llava/train/train_mem.py", line 6, in from llava.train import llama_flash_attn_monkey_patch ModuleNotFoundError: No module named 'llava'

Do you have to use the llava environment?

Feb 01 '24 07:02 a2382625920

WARNING: tokenization mismatch

I didn't encounter the same issue. Could you please share your training scripts?

After reviewing the code, I noticed that the WARNING might be caused by the commented code here. Could you please checke your local code?

#!/bin/bash

deepspeed --include localhost:0 --master_port 1234 llava/train/train_mem.py
--deepspeed ./scripts/zero2.json
--lora_enable True
--model_name_or_path /root/siton-glusterfs-eaxtsxdfs/hzt/model/Yi-VL-6B
--data_path /root/siton-glusterfs-eaxtsxdfs/hzt/data/LLaVa_data/filter_cap_1226.json
--image_folder /root/siton-glusterfs-eaxtsxdfs/xts/data/s_mix/image
--vision_tower /root/siton-glusterfs-eaxtsxdfs/hzt/model/Yi-VL-6B/vit/clip-vit-H-14-laion2B-s32B-b79K-yi-vl-6B-448
--output_dir ./checkpoint/Yi-VL-6B
--mm_vision_select_layer -2
--mm_use_im_start_end False
--mm_use_im_patch_token False
--bf16 True
--num_train_epochs 10
--per_device_train_batch_size 4
--per_device_eval_batch_size 4
--gradient_accumulation_steps 8
--evaluation_strategy "no"
--save_strategy "steps"
--save_steps 200
--save_total_limit 3
--learning_rate 2e-5
--weight_decay 0.
--warmup_ratio 0.03
--lr_scheduler_type "cosine"
--logging_steps 1
--tf32 True
--model_max_length 2048
--gradient_checkpointing True
--lazy_preprocess True
--dataloader_num_workers 4
--report_to wandb

Feb 01 '24 07:02 a2382625920

我想到了另一种可能性。训练代码是从 LLaVA 修改而来的。如果您已在本地安装了 llava，则可以将其卸载并重试。

I did use llava's virtual environment to run the code, and after I uninstalled it and installed Yi's installer environment, the following error was reported:

Traceback (most recent call last): File "/root/siton-glusterfs-eaxtsxdfs/hzt/projects/Yi/VL/llava/train/train_mem.py", line 6, in from llava.train import llama_flash_attn_monkey_patch ModuleNotFoundError: No module named 'llava'

Do you have to use the llava environment?

Could you please run the command export PYTHONPATH=$PWD:$PYTHONPATH under the VL folder and then try again? Thank you.

Feb 01 '24 09:02 minlik

Can u also share pretraining script? Which tuning projector and vision encoder with stage 1 and stage 2? This not same as llava.

Feb 27 '24 07:02 lucasjinreal

Hello! 😊

Now Swift is enhancing its multimodal capabilities through fine-tuning. It has already supported custom datasets and full parameter fine-tuning. For best practices, you can refer to this link: https://github.com/modelscope/swift/blob/main/docs/source/Multi-Modal/yi-vl%E6%9C%80%E4%BD%B3%E5%AE%9E%E8%B7%B5.md#%E5%BE%AE%E8%B0%83

If interested, you are welcome to use it ～

Mar 14 '24 13:03 Jintao-Huang

Looking forward to the official fine-tuning script provided!

Mar 14 '24 13:03 Jintao-Huang

令人忧伤的故事，没有过来查看issue的进展

Mar 14 '24 13:03 Jintao-Huang

@Jintao-Huang Hi, How can I fine-tune Yi-VL on my dataset? any docs link?

Mar 30 '24 16:03 Iven2132

@Jintao-Huang Hi, How can I fine-tune Yi-VL on my dataset? any docs link?

ms-swift offers fine-tuning for custom datasets on YI-VL, including LoRA and full-parameter options, following best practices, haha~ 😊

https://github.com/modelscope/swift/blob/main/docs/source/Multi-Modal/yi-vl%E6%9C%80%E4%BD%B3%E5%AE%9E%E8%B7%B5.md#%E5%BE%AE%E8%B0%83

Mar 30 '24 16:03 Jintao-Huang

@Jintao-Huang is there any notebook I can use to do that? also, is yi-vl-6b-chat better than neva?

Mar 30 '24 16:03 Iven2132

@Jintao-Huang How can I fine-tune the model with my custom dataset which is a JSON file? I saw the docs is using coco-mini-en-2

Mar 31 '24 08:03 Iven2132

Here ~

https://github.com/modelscope/swift/blob/main/docs/source/LLM/%E8%87%AA%E5%AE%9A%E4%B9%89%E4%B8%8E%E6%8B%93%E5%B1%95.md#-%E6%8E%A8%E8%8D%90%E5%91%BD%E4%BB%A4%E8%A1%8C%E5%8F%82%E6%95%B0%E7%9A%84%E5%BD%A2%E5%BC%8F

    --custom_train_dataset_path xxx.json \
    --custom_val_dataset_path yyy.json \

[{"query": "55555", "response": "66666", "images": ["image_path"]},
{"query": "eeeee", "response": "fffff", "history": [], "images": ["image_path"]},
{"query": "EEEEE", "response": "FFFFF", "history": [["AAAAA", "BBBBB"], ["CCCCC", "DDDDD"]], "images": ["image_path", "image_path2", "image_path3"]}]

@Jintao-Huang How can I fine-tune the model with my custom dataset which is a JSON file? I saw the docs is using coco-mini-en-2

Mar 31 '24 15:03 Jintao-Huang

Here ~

https://github.com/modelscope/swift/blob/main/docs/source/LLM/%E8%87%AA%E5%AE%9A%E4%B9%89%E4%B8%8E%E6%8B%93%E5%B1%95.md#-%E6%8E%A8%E8%8D%90%E5%91%BD%E4%BB%A4%E8%A1%8C%E5%8F%82%E6%95%B0%E7%9A%84%E5%BD%A2%E5%BC%8F
    --custom_train_dataset_path xxx.jsonl \
    --custom_val_dataset_path yyy.jsonl \
[{"query": "55555", "response": "66666", "images": ["image_path"]},
{"query": "eeeee", "response": "fffff", "history": [], "images": ["image_path"]},
{"query": "EEEEE", "response": "FFFFF", "history": [["AAAAA", "BBBBB"], ["CCCCC", "DDDDD"]], "images": ["image_path", "image_path2", "image_path3"]}]
@Jintao-Huang How can I fine-tune the model with my custom dataset which is a JSON file? I saw the docs is using coco-mini-en-2

I tried this but was getting errors, Do you have any notebook I can use? are you on 01-ai discord server, I'd love to chat!

Mar 31 '24 15:03 Iven2132

Yi Yi copied to clipboard

Question about whether Yi-VL-6B can be fine-tuned using its own dataset

Reminder

Motivation

Solution

Alternatives

Anything Else?

Are you willing to submit a PR?

Yi
Yi copied to clipboard