ViLT
ViLT copied to clipboard
Download and process GCC and SBU
I'm very sorry for my stupid question.
The datasets from the websites are the type of '.tsv' or else. Before processing arrow files, some files like '.json' are required.
If it is convenient for you, could you share your codes for downloading images and processing tsv into json? I am very sorry to disturb you.
you can download json file in here https://cs.stanford.edu/people/karpathy/deepimagesent/caption_datasets.zip
Hi @wanng-ide and thanks @trucvip123. @trucvip123 is right :)
请问你下载好GCC和SBU数据集了吗 可以分享一下吗 @wanng-ide
**wtkszzz ** commented 1小时前
完全没有。。。 我依然不知道如何把这玩意处理成0000001.jpg这种形式。。
I still do not know how to process the images into the formats like 0000001.jpg... And I do not how to check my downloaded images (completed or not) ...
@dandelin
@wanng-ide
Ouch, I misread the question.
As I said in the DATA.md
, you should write your own script.
I can't share the exact process of mine since I used my company's infrastructure.
Though, this repo seems good to me as a starting point.
同学你好,作者在描述CC和SBU的时候, root ├── images_train │ ├── 0000 # First four letters of image name │ │ ├── 0000000 # Image Binary │ │ ├── 0000001 │ │ └── ... │ ├── 0001 │ │ ├── 0001000 │ │ ├── 0001001 │ │ └── ...
中的这块是 0000 # First four letters of image name 是指图像文件名吗?就是0000001.jpg这种? 他后面的这个Image Binary是指啥呀,谢谢同学
@wanng-ide
root ├── images_train │ ├── 0000 # First four letters of the image name │ │ ├── 0000000 # Image Binary │ │ ├── 0000001 │ │ └── ... │ ├── 0001 │ │ ├── 0001000 │ │ ├── 0001001 │ │ └── ...
Hello, please forgive my stupid question. I don't know what you mean about "0000 # First four letters of image name" and "0000000 # Image Binary" in your DATA.md. Can you explain what are the "Image Binary" and "First four letters of image name"? Thanks
wtkszzz
我还没下载好。好多链接有问题。 而且没办法在国内下载。
同学你好,作者在描述CC和SBU的时候, root ├── images_train │ ├── 0000 # First four letters of image name │ │ ├── 0000000 # Image Binary │ │ ├── 0000001 │ │ └── ... │ ├── 0001 │ │ ├── 0001000 │ │ ├── 0001001 │ │ └── ...
中的这块是 0000 # First four letters of image name 是指图像文件名吗?就是0000001.jpg这种? 他后面的这个Image Binary是指啥呀,谢谢同学
@wanng-ide
您好 @yr666666 这应该是一个自动化啊的工具。 ref: https://github.com/rom1504/img2dataset 我还没有处理好。
同学可以留个联系方式吗,比如邮箱啥的,我最近也在学习这个工作。可否交流讨论一下?