What are the evaluation plans for each modality?
- Image Understanding:the official VLMEvalKit library ✅
- Multi-image and Video Understanding❓
- Audio Understanding❓
- Speech Generation❓
- End-to-end Voice Cloning❓
- Multimodal Live Streaming❓
Hi @bobo0810 ,
Thanks for your question! I will provide some details specifically for Multimodal Live Streaming, as you requested.
For evaluating MiniCPM-o 2.6's capabilities in Multimodal Live Streaming, you can refer to the following repository: https://github.com/THUNLP-MT/StreamingBench. This repository provides a comprehensive benchmark and evaluation framework for streaming MLLMs.
Here's a step-by-step guide to reproduce the results for MiniCPM-o 2.6 on the StreamingBench:
-
Inference Code: The inference code for MiniCPM-o 2.6 is located within the
src/modeldirectory of the StreamingBench repository. -
Evaluation Pipeline: Follow the "Evaluation Pipeline" instructions in the StreamingBench repository's README. This involves three main stages:
- Data Preparation
- Model Preparation
- Evaluation
By following these steps and the instructions within the StreamingBench repository, you should be able to fully reproduce the evaluation results for MiniCPM-o 2.6 on Multimodal Live Streaming. Let me know if you have any further questions!
@mjuicem Thank you very much. May I ask how to reproduce the indicators for multiple pictures, videos and audio?
@lihytotoro evaluation for multiple pictures, videos and audio
https://github.com/OpenBMB/UltraEval-Audio evaluation for audio