CLIP How is the dataset collected?

This is a question related to the paper instead of this codebase. In paper section 2.2, it briefly describes how the data are gathered by "...we constructed a new dataset of 400 million (image, text) pairs collected form a variety of publicly available sources on the Internet."

I was wondering what are the publicly available sources (e.g. Google image search, Flickr image search, etc.)?

Jan 25 '21 21:01 linzhiqiu

Tangential to @linzhiqiu's question, I'm also curious how much of the data comes from existing academic datasets / which ones were used for training the model, if any at all?

Is OpenAI open to sharing this information? I think it's crucial to examine the model for bias. For example, if Places365 was used in training the model, it's redundant to test how it performs on that dataset

Jan 26 '21 14:01 rsomani95

This is mentioned in paper: "The base query list is all words occurring at least 100 times in the English version of Wikipedia. This is augmented with bi-grams with high pointwise mutual information as well as the names of all Wikipedia articles above a certain search volume. ". I'm curious about how the high frequency words are augmented with bigrams and wikipedia articles.

Feb 05 '21 01:02 meigaoms

I too would like to echo the request for more information about the dataset.

As you show in Section 5 of the paper, overlap with the training dataset can be relevant when evaluating the performance and suitability of the pretrained model in novel contexts.

Besides that, the source, reach and quality of the dataset used are important factors in our internal ethics evaluation process, which we are currently unable to assess.

If you are unable to release full details of the dataset, would you be willing to release more partial details? Some suggestions:

a list of links to the images
a list of file hashes
a list of the 500,000 queries used to gather the images
a list of the publicly available sources used (as suggested by the original commenter)

Any additional details would be greatly appreciated and will aid us in our research into the potential impact of models like CLIP in our domain.

Apr 21 '21 15:04 Rijgersberg

This is mentioned in paper: "The base query list is all words occurring at least 100 times in the English version of Wikipedia. This is augmented with bi-grams with high pointwise mutual information as well as the names of all Wikipedia articles above a certain search volume. ". I'm curious about how the high frequency words are augmented with bigrams and wikipedia articles.

I want to know how they expand the words to a text for training? I mean if the description of the image was just format from a template(like a photo of {}), the performance would be equal to using a single keyword without using a template?

Sep 13 '21 05:09 realTaki

If you don't already know, this paper uses the released model to inversely get the closest 400M image-text pairs from web crawl dataset.

Dec 10 '21 08:12 JunweiLiang

This is mentioned in paper: "The base query list is all words occurring at least 100 times in the English version of Wikipedia. This is augmented with bi-grams with high pointwise mutual information as well as the names of all Wikipedia articles above a certain search volume. ". I'm curious about how the high frequency words are augmented with bigrams and wikipedia articles.

I want to know how they expand the words to a text for training? I mean if the description of the image was just format from a template(like a photo of {}), the performance would be equal to using a single keyword without using a template?

Maybe OpenAI used some tricks that the template format is more than a photo of {}. e.g. for anmials the template is the animal name is {} . As a result, may template will guided the model?

Dec 10 '21 10:12 igo312

CLIP CLIP copied to clipboard

How is the dataset collected?

CLIP
CLIP copied to clipboard