MiniCPM-V does llamafactory support audio-video-text data instruction tuning?

https://github.com/OpenBMB/MiniCPM-o/blob/main/docs/llamafactory_train_and_infer.md Does this llamafactory finetune script for MiniCPM-o support audio-video-text paired data like this? Thanks a lot!

    "messages": [
      {
        "content": "<video><audio>Based on the video and audio, why is this video funny?",
        "role": "user"
      },
      {
        "content": "Because a baby is reading, and he is so cute!",
        "role": "assistant"
      }
    ],
    "videos": [
      "mllm_demo_data/1.mp4"
    ],
   "audios": [
      "mllm_demo_data/1.mp3"
    ]
  },```

Jun 23 '25 09:06 jessie-chen99

@ZMXJJ Could you help solve this issue?

Aug 17 '25 13:08 tc-mb

@tc-mb Let me confirm it.

Aug 17 '25 13:08 ZMXJJ

@ZMXJJ @tc-mb Similar question. Does this llamafactory finetune script for MiniCPM-V-4_5 support multimodal training, such as having images and videos in one message and audio in another message? [ { "messages": [ { "content": "<audio>What's that sound?", "role": "user" }, { "content": "It is the sound of glass shattering.", "role": "assistant" } ], "videos": [ "mllm_demo_data/1.mp4" ], "audios": [ "mllm_demo_data/1.mp3" ] }, { "messages": [ { "content": "<audio>What can you hear?", "role": "user" }, { "content": "A woman is coughing.", "role": "assistant" } ], "images": [ "mllm_demo_data/2.jpg" ] }, { "messages": [ { "content": "<audio>What does the person say?", "role": "user" }, { "content": "Mister Quiller is the apostle of the middle classes and we are glad to welcome his gospel.", "role": "assistant" } ], "image": [ "mllm_demo_data/3.jpg" ] "video": [ "mllm_demo_data/3.mp4" ] } ]

Sep 09 '25 07:09 kada0720

@kada0720 @jessie-chen99 No problem, you can just follow this LLaMA-Factory file to build the relevant data. But keep in mind — if you’re using MiniCPM-V 4.5, you won’t be able to fine-tune with audio.

Sep 09 '25 15:09 ZMXJJ