Question about baseline OpenAI CLIP zero-shot retrieval results on Flickr30K

Open LZY-233 opened this issue 2 months ago • 0 comments

Hi, thanks for sharing this impressive work! 🎉

I really appreciate your efforts on extending CLIP with Mixture-of-Experts — the idea of scaling representational capacity efficiently is very insightful, and your experimental results look quite promising.

However, I noticed that in Table 2 of your paper, the reported zero-shot retrieval performance of the baseline OpenAI CLIP on the Flickr30K dataset differs significantly from the results reported in the original CLIP paper . Here are the two tables

For example, the I2T Recall@1 / Recall@5 / Recall@10 values seem notably different. Could you please clarify whether:

You re-evaluated CLIP under a different setup (e.g., different preprocessing, data splits, or evaluation code)?
Or are these numbers taken from another reference implementation?

It would be great if you could provide some details on how the baseline CLIP results were obtained — this will really help readers reproduce and fairly compare future works.

Thanks again for your excellent contribution and for making the code and models available to the community!

Oct 12 '25 12:10 LZY-233