MovieChat Question about results on Egoschema

Hi, thanks for your great work! I have read your MovieChat+ paper and noticed that the Zero-shot QA Evaluation result of MovieChat on EgoSchema is 53.5, while the evaluation result in this CVPR paper(Koala: Key frame-conditioned long video-LLM https://arxiv.org/pdf/2404.04346) is much lower. I guess the possible reason is that the LLM used and the way to evaluate are different, so I would like to confirm what LLM you used for the EgoSchema result(Koala used llama2) and the specific implementation of the LangChain evaluation. Thank you very much!

Apr 30 '24 13:04 pPetrichor

For a fair comparison , we use llama.

EgoSchema is a VQA dataset with multiple choice, and it has been proved that when providing the model with choices, the order will effect the answer. We find that with the questoin only (we do not use any other prompt), the answer is more relative to the question and leads to a higher score. Once we get the answer provided by MovieChat, we ask LangChain to calculate the similarity with the multiple choice, and select the most similar one as our prediction.

Apr 30 '24 13:04 Espere-1119-Song

Thanks for your kind response. Would you please provide the inference code which "asks LangChain to calculate the similarity with the multiple choices" so we can align the evaluation way better? Thanks a lot!

May 18 '24 10:05 pPetrichor

Unfortunately, we can't provided you with the code directly. For the evaluation code with LangChain, you can refer to https://python.langchain.com.cn/docs/modules/model_io/prompts/example_selectors/similarity, and we just take the answers as the "Input". Hope this can be helpful to you!

May 18 '24 13:05 Espere-1119-Song

Hi, @Espere-1119-Song, now the openai key is prohibited. So I replace the 'OpenAIEmbeddings()' with Ollama. Could you tell which embedding models you used in your evaluation code?

Aug 06 '24 11:08 msra-jqxu