ViLT
ViLT copied to clipboard
Question about GCC dataset download
root ├── images_train │ ├── 0000 # First four letters of the image name │ │ ├── 0000000 # Image Binary │ │ ├── 0000001 │ │ └── ... │ ├── 0001 │ │ ├── 0001000 │ │ ├── 0001001 │ │ └── ...
Hello, please forgive my stupid question. I don't know what you mean about "0000 # First four letters of image name" and "0000000 # Image Binary" in your DATA.md. Can you explain what are the "Image Binary" and "First four letters of image name"? Thanks
Hi @yr666666
GCC (CC3M) provides the dataset in the form of image URLs and their related caption.
Since their original filenames are un-ordered and they have various formats, I renamed them to the ordered sequence without the extension (like .jpg, .png, ...) during the download.
So these renamed "image files (binaries)" have names such as 0000000
, 0000001
, ..., 2983222
, etc.
If I put all files in a single directory, it slows down disk-related operations. Thus I partitioned them into several directories named "first four letters of the image name" so that every directory has 1000 files at maximum.