does llamafactory support audio-video-text data instruction tuning?
https://github.com/OpenBMB/MiniCPM-o/blob/main/docs/llamafactory_train_and_infer.md Does this llamafactory finetune script for MiniCPM-o support audio-video-text paired data like this? Thanks a lot!
"messages": [
{
"content": "<video><audio>Based on the video and audio, why is this video funny?",
"role": "user"
},
{
"content": "Because a baby is reading, and he is so cute!",
"role": "assistant"
}
],
"videos": [
"mllm_demo_data/1.mp4"
],
"audios": [
"mllm_demo_data/1.mp3"
]
},```
@ZMXJJ Could you help solve this issue?
@tc-mb Let me confirm it.
@ZMXJJ @tc-mb Similar question.
Does this llamafactory finetune script for MiniCPM-V-4_5 support multimodal training, such as having images and videos in one message and audio in another message?
[ { "messages": [ { "content": "<audio>What's that sound?", "role": "user" }, { "content": "It is the sound of glass shattering.", "role": "assistant" } ], "videos": [ "mllm_demo_data/1.mp4" ], "audios": [ "mllm_demo_data/1.mp3" ] }, { "messages": [ { "content": "<audio>What can you hear?", "role": "user" }, { "content": "A woman is coughing.", "role": "assistant" } ], "images": [ "mllm_demo_data/2.jpg" ] }, { "messages": [ { "content": "<audio>What does the person say?", "role": "user" }, { "content": "Mister Quiller is the apostle of the middle classes and we are glad to welcome his gospel.", "role": "assistant" } ], "image": [ "mllm_demo_data/3.jpg" ] "video": [ "mllm_demo_data/3.mp4" ] } ]
@kada0720 @jessie-chen99 No problem, you can just follow this LLaMA-Factory file to build the relevant data. But keep in mind — if you’re using MiniCPM-V 4.5, you won’t be able to fine-tune with audio.