MiniCPM-V What are the evaluation plans for each modality?

Image Understanding：the official VLMEvalKit library ✅
Multi-image and Video Understanding❓
Audio Understanding❓
Speech Generation❓
End-to-end Voice Cloning❓
Multimodal Live Streaming❓

Feb 14 '25 02:02 bobo0810

Hi @bobo0810 ,

Thanks for your question! I will provide some details specifically for Multimodal Live Streaming, as you requested.

For evaluating MiniCPM-o 2.6's capabilities in Multimodal Live Streaming, you can refer to the following repository: https://github.com/THUNLP-MT/StreamingBench. This repository provides a comprehensive benchmark and evaluation framework for streaming MLLMs.

Here's a step-by-step guide to reproduce the results for MiniCPM-o 2.6 on the StreamingBench:

Inference Code: The inference code for MiniCPM-o 2.6 is located within the src/model directory of the StreamingBench repository.
Evaluation Pipeline: Follow the "Evaluation Pipeline" instructions in the StreamingBench repository's README. This involves three main stages:
- Data Preparation
- Model Preparation
- Evaluation

By following these steps and the instructions within the StreamingBench repository, you should be able to fully reproduce the evaluation results for MiniCPM-o 2.6 on Multimodal Live Streaming. Let me know if you have any further questions!

Feb 14 '25 21:02 mjuicem

@mjuicem Thank you very much. May I ask how to reproduce the indicators for multiple pictures, videos and audio?

Feb 15 '25 11:02 bobo0810

@lihytotoro evaluation for multiple pictures, videos and audio

https://github.com/OpenBMB/UltraEval-Audio evaluation for audio

Feb 17 '25 02:02 Cuiunbo