FlagEmbedding Visualized BGE Evaluation

Hi, i want to reproduce the result of Visualized BGE, but zero-shot benchmark not clear, such as WebQA. Can you provide evaluation dataset and codes for zero-shot benchmark. Thanks!

Jul 06 '24 09:07 zwhus

Sure, I will release the relevant evaluation datasets and evaluation code soon. I will inform you when it is complete. Thank you for your attention and patience.

Jul 06 '24 17:07 JUNJIE99

Thanks, can you provide a time for me?

Jul 07 '24 04:07 zwhus

Hello, the WebQA dataset and evaluation code have been made available at here. Should you have any further questions, please feel free to reach out.

We will also be progressively uploading other evaluation datasets.

Thanks.

Jul 08 '24 07:07 JUNJIE99

Thanks, I will try it

Jul 09 '24 09:07 zwhus

Thanks, Can you provide other benchmark such as FashIQ and CIRR? I notice CIRR score is 23.9(R1) in Pixword, but in paper is 23.42(R5)

Jul 15 '24 07:07 zwhus

I believe you're referring to Pic2Word. The R@1 results of Pic2Word are based on the test set, and the test corpus of CIRR only contains 2,316 images. However, our tests are conducted using the entire CIRR image corpus, which includes 21,551 images. The size of the corpus is ten times that of the test corpus, which will inevitably lead to differences in the metrics.

The datasets for CIRR and FashionIQ (including label files and all images) have been updated in this link. The format of all benchmark files is similar, so if you're in a hurry, you can make simple adjustments based on the WebQA code.

Jul 15 '24 07:07 JUNJIE99

Thank you for your quick response. I have noticed this detail. However, there is no CIRR image in the link. After I installed Pix2Word and downloaded it, there are 16,939 training images and 2,315 for testing and validation, with a total of 21,569 images. It seems there is a slight difference. Is there any additional operation required?

Jul 15 '24 08:07 zwhus

I apologize, it seems that the CIRR image upload was interrupted earlier due to network issues. The re-upload has now been completed. You should be able to see it in this link.

Regarding the slight difference, we did not perform any additional operations on the CIRR dataset. I checked the CIRR dataset paper, and they reported a total of 21,552 images in Table 2. This number is closer to the size of our corpus.

Jul 15 '24 08:07 JUNJIE99

Thanks, When evaluating other datasets, are the parameters used consistent with those for WebQA, such as k? I made some simple modifications for FashionIQ, but the metrics show a slight difference. If the parameters are consistent, I will carefully check the code.

Jul 15 '24 09:07 zwhus

Thanks, When evaluating other datasets, are the parameters used consistent with those for WebQA, such as k? I made some simple modifications for FashionIQ, but the metrics show a slight difference. If the parameters are consistent, I will carefully check the code.

Yes, since we only calculate up to the top-100, setting k to 100 would be sufficient.

In addition, you need to pay attention to modifying the arguments in the model.encode_* within the index and search functions. For example, for FashionIQ, the corpus_type in the search function should be changed to mm_it because its query is image-text data.

Jul 15 '24 09:07 JUNJIE99

Thank you, I was able to reproduce the results. I mistakenly compared the results of m3 with the base.

Jul 15 '24 09:07 zwhus

Great!

Jul 15 '24 09:07 JUNJIE99