MiniCPM-V icon indicating copy to clipboard operation
MiniCPM-V copied to clipboard

💡 [REQUEST] - Simultaneous multimodal inputs

Open Jiltseb opened this issue 8 months ago • 25 comments

起始日期 | Start Date

No response

实现PR | Implementation PR

Can it already generate outputs if audio and video are provided at the same time? I have tried it and it always gets the results of the visual prompt, ignoring the audio part completely.

This is useful if you want to find the video summary of a video with a spoken content, which is very common. Enabling this ensures one forward pass to be sufficient to get both audio and visual summary together. Is the model capable/trained for this as well?

相关Issues | Reference Issues

No response

摘要 | Summary

Can it already generate outputs if audio and video are provided at the same time? I have tried it and it always gets the results of the visual prompt, ignoring the audio part completely.

基本示例 | Basic Example

For example with VLLM this would look like this:

    audio_placeholder = "(<audio>./</audio>)" * 1
    video_placeholder = "(<video>./</video>)" * 1
    multimodal_prompt = "Use transcription and overall acoustic and visual information to write a concise summary of the input containing spoken content."
    msgs = [{'role': 'user', 'content': f'{audio_placeholder}{video_placeholder}\n{multimodal_prompt}'}] # str([task_prompt,audio_part])}]#"(<image>./</image>)" + \ #possible error!

    prompt = tokenizer.apply_chat_template(
        msgs,
        tokenize=False,
        add_generation_prompt=True
    )

    input_data = {
        "prompt": prompt,
        "multi_modal_data": {
            "video": video_part,
            "audio":(audio_part, 16000),
        
        }
    }
    res = llm.generate(input_data, sampling_params=sampling_params)

缺陷 | Drawbacks

I could not see any drawbacks of the proposed method :)

未解决问题 | Unresolved questions

No response

Jiltseb avatar Feb 20 '25 13:02 Jiltseb