Question about baseline OpenAI CLIP zero-shot retrieval results on Flickr30K
Hi, thanks for sharing this impressive work! 🎉
I really appreciate your efforts on extending CLIP with Mixture-of-Experts — the idea of scaling representational capacity efficiently is very insightful, and your experimental results look quite promising.
However, I noticed that in Table 2 of your paper, the reported zero-shot retrieval performance of the baseline OpenAI CLIP on the Flickr30K dataset differs significantly from the results reported in the original CLIP paper . Here are the two tables
For example, the I2T Recall@1 / Recall@5 / Recall@10 values seem notably different. Could you please clarify whether:
-
You re-evaluated CLIP under a different setup (e.g., different preprocessing, data splits, or evaluation code)?
-
Or are these numbers taken from another reference implementation?
It would be great if you could provide some details on how the baseline CLIP results were obtained — this will really help readers reproduce and fairly compare future works.
Thanks again for your excellent contribution and for making the code and models available to the community!