Metrics
I noticed that the result of e5-v on COCO retrieval in the text is 52 / 62, but in Appendix C, the remaining MLLM-based results are mentioned, as shown in the figure below. I understand that the COCO result is abnormal.
On one hand, why is there such a significant difference with LLaVA-NeXT-8B? On the other hand, these two results don't seem to align with the top-1 metrics for COCO I2T or T2I. Could you please clarify the source of the metrics in this table or provide the relevant metrics for Phi-3V?
The results in this picture are the settings without fine-tuning, which means we only use prompts to get the results. The results of E5-V are from fine-tuning on text pairs, which is reported in the red box in the following table. The results for COCO are 76.5/83.6, which match the T2I/I2T R@5 metrics in Table 1
Thanks, but could you provide the relevant metrics(T2I/I2T R@1) on COCO and Flickr30k for Phi-3V on with-finetuning setting?
Sorry, but I only keep the results of R@5 for Phi-3V with the fine-tuning setting.
Thanks