CLIP-MoE icon indicating copy to clipboard operation
CLIP-MoE copied to clipboard

Question about baseline OpenAI CLIP zero-shot retrieval results on Flickr30K

Open LZY-233 opened this issue 2 months ago • 0 comments

Hi, thanks for sharing this impressive work! 🎉

I really appreciate your efforts on extending CLIP with Mixture-of-Experts — the idea of scaling representational capacity efficiently is very insightful, and your experimental results look quite promising.

However, I noticed that in Table 2 of your paper, the reported zero-shot retrieval performance of the baseline OpenAI CLIP on the Flickr30K dataset differs significantly from the results reported in the original CLIP paper . Here are the two tables

Image Image

For example, the I2T Recall@1 / Recall@5 / Recall@10 values seem notably different. Could you please clarify whether:

  • You re-evaluated CLIP under a different setup (e.g., different preprocessing, data splits, or evaluation code)?

  • Or are these numbers taken from another reference implementation?

It would be great if you could provide some details on how the baseline CLIP results were obtained — this will really help readers reproduce and fairly compare future works.

Thanks again for your excellent contribution and for making the code and models available to the community!

LZY-233 avatar Oct 12 '25 12:10 LZY-233