ViLT icon indicating copy to clipboard operation
ViLT copied to clipboard

Download and process GCC and SBU

Open wanng-ide opened this issue 3 years ago • 10 comments

I'm very sorry for my stupid question.

The datasets from the websites are the type of '.tsv' or else. Before processing arrow files, some files like '.json' are required.

If it is convenient for you, could you share your codes for downloading images and processing tsv into json? I am very sorry to disturb you.

wanng-ide avatar Nov 16 '21 03:11 wanng-ide

you can download json file in here https://cs.stanford.edu/people/karpathy/deepimagesent/caption_datasets.zip

trucvip123 avatar Nov 18 '21 11:11 trucvip123

Hi @wanng-ide and thanks @trucvip123. @trucvip123 is right :)

dandelin avatar Nov 22 '21 14:11 dandelin

请问你下载好GCC和SBU数据集了吗 可以分享一下吗 @wanng-ide

wtkszzz avatar Dec 12 '21 04:12 wtkszzz

**wtkszzz ** commented 1小时前

完全没有。。。 我依然不知道如何把这玩意处理成0000001.jpg这种形式。。

I still do not know how to process the images into the formats like 0000001.jpg... And I do not how to check my downloaded images (completed or not) ...

@dandelin

wanng-ide avatar Dec 12 '21 05:12 wanng-ide

@wanng-ide

Ouch, I misread the question. As I said in the DATA.md, you should write your own script. I can't share the exact process of mine since I used my company's infrastructure. Though, this repo seems good to me as a starting point.

dandelin avatar Dec 12 '21 05:12 dandelin

同学你好,作者在描述CC和SBU的时候, root ├── images_train │ ├── 0000 # First four letters of image name │ │ ├── 0000000 # Image Binary │ │ ├── 0000001 │ │ └── ... │ ├── 0001 │ │ ├── 0001000 │ │ ├── 0001001 │ │ └── ...

中的这块是 0000 # First four letters of image name 是指图像文件名吗?就是0000001.jpg这种? 他后面的这个Image Binary是指啥呀,谢谢同学

@wanng-ide

yr666666 avatar Dec 26 '21 07:12 yr666666

root ├── images_train │ ├── 0000 # First four letters of the image name │ │ ├── 0000000 # Image Binary │ │ ├── 0000001 │ │ └── ... │ ├── 0001 │ │ ├── 0001000 │ │ ├── 0001001 │ │ └── ...

Hello, please forgive my stupid question. I don't know what you mean about "0000 # First four letters of image name" and "0000000 # Image Binary" in your DATA.md. Can you explain what are the "Image Binary" and "First four letters of image name"? Thanks

yr666666 avatar Dec 26 '21 08:12 yr666666

wtkszzz

我还没下载好。好多链接有问题。 而且没办法在国内下载。

wanng-ide avatar Dec 28 '21 06:12 wanng-ide

同学你好,作者在描述CC和SBU的时候, root ├── images_train │ ├── 0000 # First four letters of image name │ │ ├── 0000000 # Image Binary │ │ ├── 0000001 │ │ └── ... │ ├── 0001 │ │ ├── 0001000 │ │ ├── 0001001 │ │ └── ...

中的这块是 0000 # First four letters of image name 是指图像文件名吗?就是0000001.jpg这种? 他后面的这个Image Binary是指啥呀,谢谢同学

@wanng-ide

您好 @yr666666 这应该是一个自动化啊的工具。 ref: https://github.com/rom1504/img2dataset 我还没有处理好。

wanng-ide avatar Dec 28 '21 06:12 wanng-ide

同学可以留个联系方式吗,比如邮箱啥的,我最近也在学习这个工作。可否交流讨论一下?

yr666666 avatar Dec 28 '21 13:12 yr666666