vision-language-models-are-bows
vision-language-models-are-bows copied to clipboard
Questions on evaluation results
Dear Authors,
Firstly, I appreciate for your outstanding project. I've been playing with your benchmark using the provided codebase, and noticed some discrepancies between the evaluation results I obtained and those reported in your paper.
Below is a comparative overview table of my evaluation results on the VG datasets and PRC tasks using COCO_order and Flickr30k_order:
model | pretrained | vg_relation | vg_attribution | coco_order | flickr30k_order | Task Avg. |
---|---|---|---|---|---|---|
ViT-B-32 | openai | 59.9% | 63.2% | 47.4% | 58.8% | 57.3% |
NegCLIP | coco ft | 80.2% | 70.5% | 86.8% | 89.7% | 81.8% |
BLIP-base | flickr ft | 49.7% | 89.9% | 42.5% | 40.5% | 55.7% |
BLIP-base | coco ft | 58.4% | 89.5% | 37.1% | 46.3% | 57.8% |
I observed that my reproduced results for VG_Relation and VG_Attribution closely align with the numbers presented in your paper. However, I have concerns regarding the NegCLIP results for flickr30k_order, where a 91% (0.91) is reported in your paper (Appendix Table 6).
In addition, regarding the BLIP models, there seems to be a somewhat higher discrepancy. In your paper from Appendix Table 5, the results reported are 0.369 for Flickr30k-PRC (BLIP-flickr-base) and 0.321 for COCO-PRC (BLIP-coco-base). In contrast, my results showed significantly higher scores of 40.5% and 37.1%, respectively, for the same models.
Note 1: I observed that some level of randomness arises when creating order annotations from the original annotation file. However, this randomness does not seem to cause the large gap observed.
Note 2: To account for the randomness, I maintained the same order annotations across different models in my experiments.
Given that these results were obtained using the provided codes and checkpoints, I would like to see any potential my mistakes or what was gone wrong by any reason.
Best regards,