MiniCPM-V
MiniCPM-V copied to clipboard
💡 [REQUEST] - Simultaneous multimodal inputs
起始日期 | Start Date
No response
实现PR | Implementation PR
Can it already generate outputs if audio and video are provided at the same time? I have tried it and it always gets the results of the visual prompt, ignoring the audio part completely.
This is useful if you want to find the video summary of a video with a spoken content, which is very common. Enabling this ensures one forward pass to be sufficient to get both audio and visual summary together. Is the model capable/trained for this as well?
相关Issues | Reference Issues
No response
摘要 | Summary
Can it already generate outputs if audio and video are provided at the same time? I have tried it and it always gets the results of the visual prompt, ignoring the audio part completely.
基本示例 | Basic Example
For example with VLLM this would look like this:
audio_placeholder = "(<audio>./</audio>)" * 1
video_placeholder = "(<video>./</video>)" * 1
multimodal_prompt = "Use transcription and overall acoustic and visual information to write a concise summary of the input containing spoken content."
msgs = [{'role': 'user', 'content': f'{audio_placeholder}{video_placeholder}\n{multimodal_prompt}'}] # str([task_prompt,audio_part])}]#"(<image>./</image>)" + \ #possible error!
prompt = tokenizer.apply_chat_template(
msgs,
tokenize=False,
add_generation_prompt=True
)
input_data = {
"prompt": prompt,
"multi_modal_data": {
"video": video_part,
"audio":(audio_part, 16000),
}
}
res = llm.generate(input_data, sampling_params=sampling_params)
缺陷 | Drawbacks
I could not see any drawbacks of the proposed method :)
未解决问题 | Unresolved questions
No response