Are there public datasets in your WIT dataset ?

Open soniamartinot opened this issue 1 year ago • 1 comments

Does the WIT dataset contain images / text from public datasets such as COCO, DIOR, BRATs, DOTA ... ????

Without this knowledge, current works using CLIP are undermined by the assumption that there is a data leak issue due to your training on an unspecified dataset, thus hindering research based on CLIP.

Dec 19 '24 18:12 soniamartinot

The abstract of the CLIP paper says:

a dataset of 400 million (image, text) pairs collected from the internet

The COCO paper says:

we collected images from Flickr

and section 3.2 reads like they also used Google and Bing image search.

So yes, there might be data contamination. Whether that actually matters depends on the problem being solved.

Jan 23 '25 12:01 99991