GroupViT
GroupViT copied to clipboard
Mistakes in the data preparation command
https://github.com/NVlabs/GroupViT#gcc3m
sed -i '1s/^/caption\turl\n/' gcc3m.tsv
img2dataset --url_list gcc3m.tsv --input_format "tsv" \
--url_col "url" --caption_col "caption" --output_format webdataset\
--output_folder local_data/gcc3m_shards
--processes_count 16 --thread_count 64
--image_size 512 --resize_mode keep_ratio --resize_only_if_bigger True \
--enable_wandb True --save_metadata False --oom_shard_count 6
rename -d 's/^/gcc-train-/' local_data/gcc3m_shards/*
- After checking the documentation of
img2dataset, I find that there is no argument--save_metadataforimg2datasetand this may cause an error.
- Another mistake is #7.
- If not misunderstood, the command
renamemay not have the-doptions. The tested OS is Ubuntu 18.04 LTS.
- There are differences between
gcc12m.tsvandgcc3m.tsvafter downloading. Forgcc12m.tsv, the format is[url]\t[caption], while the format forgcc3m.tsvis[caption]\t[url]. Hence thesedcommand has mistake.
Hi @slyviacassell
- We process the dataset with img2dataset==1.12.0 https://github.com/rom1504/img2dataset/tree/1.12.0
renamehas option-din Ubuntu 20.04
2 & 4 is a mistake and we will fix it. Thank you for pointing out!
FYI, thanks.
FYI, thanks. Would like to ask if gcc3M supports breakpoint download? Also, does gcc3M have 436 .tar files? Is it possible to pre-train on gcc3M alone? thanks.
Hi @pzhren You may comment out "gcc12m" and "yfcc14m" to train on "gcc3m" only
Would like to ask if gcc3M supports breakpoint download?
As far as I know, the img2dataset doesn't support incremental download at this moment.
Would like to ask if gcc3M supports breakpoint download?
As far as I know, the
img2datasetdoesn't support incremental download at this moment.
Is it normal for the gcc3M I downloaded to have only 332 .tar files?
@pzhren Maybe some of the link is invalid. But it should be fine.
renamehas option-din Ubuntu 20.04
@xvjiarui Sorry for the late reply and now I am getting back to this project. I have checked the manual for rename in Ubuntu 20.04 and there is no -d option for rename. Just for sure, could you tell me the installation method of rename to figure out that we are talking about the same command? In my case, rename is installed with apt install rename. Moreover, could you also tell the purpose of the usage of the -d option?
Hi @slyviacassell
I installed rename by running sudo apt-get install rename
(base) ➜ ~ rename -h
Usage:
rename [ -h|-m|-V ] [ -v ] [ -0 ] [ -n ] [ -f ] [ -d ]
[ -e|-E perlexpr]*|perlexpr [ files ]
Options:
-v, --verbose
Verbose: print names of files successfully renamed.
-0, --null
Use \0 as record separator when reading from STDIN.
-n, --nono
No action: print names of files to be renamed, but don't rename.
-f, --force
Over write: allow existing files to be over-written.
--path, --fullpath
Rename full path: including any directory component. DEFAULT
-d, --filename, --nopath, --nofullpath
Do not rename directory: only rename filename component of path.
-h, --help
Help: print SYNOPSIS and OPTIONS.
-m, --man
Manual: print manual page.
-V, --version
Version: show version number.
-e Expression: code to act on files name.
May be repeated to build up code (like "perl -e"). If no -e, the
first argument is used as code.
-E Statement: code to act on files name, as -e but terminated by
';'.
run
python convert_dataset/create_subset.py --input-dir . --output-dir . --subset yfcc100m_subset_data.tsv
report an error:
INFO: Pandarallel will run on 20 workers.
INFO: Pandarallel will use Memory file system to transfer data between the main process and workers.
Traceback (most recent call last):
File "convert_dataset/create_subset.py", line 114, in <module>
main()
File "convert_dataset/create_subset.py", line 64, in main
download_db(files)
File "/mnt/cephfs/home/rpz/anaconda3/envs/groupvit/lib/python3.7/site-packages/yfcc100m/convert_metadata.py", line 63, in download_db
total = obj.content_length
File "/mnt/cephfs/home/rpz/anaconda3/envs/groupvit/lib/python3.7/site-packages/boto3/resources/factory.py", line 380, in property_loader
self.load()
File "/mnt/cephfs/home/rpz/anaconda3/envs/groupvit/lib/python3.7/site-packages/boto3/resources/factory.py", line 564, in do_action
response = action(self, *args, **kwargs)
File "/mnt/cephfs/home/rpz/anaconda3/envs/groupvit/lib/python3.7/site-packages/boto3/resources/action.py", line 88, in __call__
response = getattr(parent.meta.client, operation_name)(*args, **params)
File "/mnt/cephfs/home/rpz/anaconda3/envs/groupvit/lib/python3.7/site-packages/botocore/client.py", line 415, in _api_call
return self._make_api_call(operation_name, kwargs)
File "/mnt/cephfs/home/rpz/anaconda3/envs/groupvit/lib/python3.7/site-packages/botocore/client.py", line 732, in _make_api_call
operation_model, request_dict, request_context)
File "/mnt/cephfs/home/rpz/anaconda3/envs/groupvit/lib/python3.7/site-packages/botocore/client.py", line 751, in _make_request
return self._endpoint.make_request(operation_model, request_dict)
File "/mnt/cephfs/home/rpz/anaconda3/envs/groupvit/lib/python3.7/site-packages/botocore/endpoint.py", line 107, in make_request
return self._send_request(request_dict, operation_model)
File "/mnt/cephfs/home/rpz/anaconda3/envs/groupvit/lib/python3.7/site-packages/botocore/endpoint.py", line 180, in _send_request
request = self.create_request(request_dict, operation_model)
File "/mnt/cephfs/home/rpz/anaconda3/envs/groupvit/lib/python3.7/site-packages/botocore/endpoint.py", line 121, in create_request
operation_name=operation_model.name)
File "/mnt/cephfs/home/rpz/anaconda3/envs/groupvit/lib/python3.7/site-packages/botocore/hooks.py", line 358, in emit
return self._emitter.emit(aliased_event_name, **kwargs)
File "/mnt/cephfs/home/rpz/anaconda3/envs/groupvit/lib/python3.7/site-packages/botocore/hooks.py", line 229, in emit
return self._emit(event_name, kwargs)
File "/mnt/cephfs/home/rpz/anaconda3/envs/groupvit/lib/python3.7/site-packages/botocore/hooks.py", line 212, in _emit
response = handler(**kwargs)
File "/mnt/cephfs/home/rpz/anaconda3/envs/groupvit/lib/python3.7/site-packages/botocore/signers.py", line 95, in handler
return self.sign(operation_name, request)
File "/mnt/cephfs/home/rpz/anaconda3/envs/groupvit/lib/python3.7/site-packages/botocore/signers.py", line 167, in sign
auth.add_auth(request)
File "/mnt/cephfs/home/rpz/anaconda3/envs/groupvit/lib/python3.7/site-packages/botocore/auth.py", line 401, in add_auth
raise NoCredentialsError()
botocore.exceptions.NoCredentialsError: Unable to locate credentials
How do you deal with that?
Hi @pzhren You need to follow the instruction here and set up your own AWS credential.
Hi @pzhren You need to follow the instruction here and set up your own AWS credential. I set up
aws configure. then run 'python convert_dataset/create_subset.py --input-dir . --output-dir . --subset yfcc100m_subset_data.tsv' generated a new error:
It seems that your credential is not correctly setup
It seems that your credential is not correctly setup
Is this a problem?
