What exactly are the sources of WebImageText dataset?

Open andrey-savov opened this issue 3 years ago • 5 comments

In this paper there is only a vague description about the WIT dataset:

... we constructed a new dataset of 400 million (image, text) pairs collected form a variety of publicly available sources on the Internet.

Can you enumerate specific sources and methods used to acquire the images?

Jun 19 '22 19:06 andrey-savov

How are texts engineered? Paper talks about prompt engineering for tasks datasets but not for the 400M training dataset!!

Aug 10 '22 13:08 ShilpaGopal

Looking for more information here

Apr 23 '24 03:04 placeforyiming