FlagEmbedding icon indicating copy to clipboard operation
FlagEmbedding copied to clipboard

Visualized BGE Evaluation

Open zwhus opened this issue 1 year ago • 12 comments

Hi, i want to reproduce the result of Visualized BGE, but zero-shot benchmark not clear, such as WebQA. Can you provide evaluation dataset and codes for zero-shot benchmark. Thanks!

zwhus avatar Jul 06 '24 09:07 zwhus

Sure, I will release the relevant evaluation datasets and evaluation code soon. I will inform you when it is complete. Thank you for your attention and patience.

JUNJIE99 avatar Jul 06 '24 17:07 JUNJIE99

Thanks, can you provide a time for me?

zwhus avatar Jul 07 '24 04:07 zwhus

Hello, the WebQA dataset and evaluation code have been made available at here. Should you have any further questions, please feel free to reach out.

We will also be progressively uploading other evaluation datasets.

Thanks.

JUNJIE99 avatar Jul 08 '24 07:07 JUNJIE99

Thanks, I will try it

zwhus avatar Jul 09 '24 09:07 zwhus

Thanks, Can you provide other benchmark such as FashIQ and CIRR? I notice CIRR score is 23.9(R1) in Pixword, but in paper is 23.42(R5)

zwhus avatar Jul 15 '24 07:07 zwhus

I believe you're referring to Pic2Word. The R@1 results of Pic2Word are based on the test set, and the test corpus of CIRR only contains 2,316 images. However, our tests are conducted using the entire CIRR image corpus, which includes 21,551 images. The size of the corpus is ten times that of the test corpus, which will inevitably lead to differences in the metrics.

The datasets for CIRR and FashionIQ (including label files and all images) have been updated in this link. The format of all benchmark files is similar, so if you're in a hurry, you can make simple adjustments based on the WebQA code.

JUNJIE99 avatar Jul 15 '24 07:07 JUNJIE99

Thank you for your quick response. I have noticed this detail. However, there is no CIRR image in the link. After I installed Pix2Word and downloaded it, there are 16,939 training images and 2,315 for testing and validation, with a total of 21,569 images. It seems there is a slight difference. Is there any additional operation required?

zwhus avatar Jul 15 '24 08:07 zwhus

I apologize, it seems that the CIRR image upload was interrupted earlier due to network issues. The re-upload has now been completed. You should be able to see it in this link.

Regarding the slight difference, we did not perform any additional operations on the CIRR dataset. I checked the CIRR dataset paper, and they reported a total of 21,552 images in Table 2. This number is closer to the size of our corpus.

JUNJIE99 avatar Jul 15 '24 08:07 JUNJIE99

Thanks, When evaluating other datasets, are the parameters used consistent with those for WebQA, such as k? I made some simple modifications for FashionIQ, but the metrics show a slight difference. If the parameters are consistent, I will carefully check the code.

zwhus avatar Jul 15 '24 09:07 zwhus

Thanks, When evaluating other datasets, are the parameters used consistent with those for WebQA, such as k? I made some simple modifications for FashionIQ, but the metrics show a slight difference. If the parameters are consistent, I will carefully check the code.

Yes, since we only calculate up to the top-100, setting k to 100 would be sufficient.

In addition, you need to pay attention to modifying the arguments in the model.encode_* within the index and search functions. For example, for FashionIQ, the corpus_type in the search function should be changed to mm_it because its query is image-text data.

JUNJIE99 avatar Jul 15 '24 09:07 JUNJIE99

Thank you, I was able to reproduce the results. I mistakenly compared the results of m3 with the base.

zwhus avatar Jul 15 '24 09:07 zwhus

Great!

JUNJIE99 avatar Jul 15 '24 09:07 JUNJIE99