ViLT icon indicating copy to clipboard operation
ViLT copied to clipboard

Question about GCC dataset download

Open yr666666 opened this issue 3 years ago • 1 comments

root ├── images_train │ ├── 0000 # First four letters of the image name │ │ ├── 0000000 # Image Binary │ │ ├── 0000001 │ │ └── ... │ ├── 0001 │ │ ├── 0001000 │ │ ├── 0001001 │ │ └── ...

Hello, please forgive my stupid question. I don't know what you mean about "0000 # First four letters of image name" and "0000000 # Image Binary" in your DATA.md. Can you explain what are the "Image Binary" and "First four letters of image name"? Thanks

yr666666 avatar Dec 26 '21 08:12 yr666666

Hi @yr666666

GCC (CC3M) provides the dataset in the form of image URLs and their related caption. Since their original filenames are un-ordered and they have various formats, I renamed them to the ordered sequence without the extension (like .jpg, .png, ...) during the download. So these renamed "image files (binaries)" have names such as 0000000, 0000001, ..., 2983222, etc.

If I put all files in a single directory, it slows down disk-related operations. Thus I partitioned them into several directories named "first four letters of the image name" so that every directory has 1000 files at maximum.

dandelin avatar Dec 30 '21 15:12 dandelin