GroupViT Mistakes in the data preparation command

https://github.com/NVlabs/GroupViT#gcc3m

sed -i '1s/^/caption\turl\n/' gcc3m.tsv
img2dataset --url_list gcc3m.tsv --input_format "tsv" \
            --url_col "url" --caption_col "caption" --output_format webdataset\
            --output_folder local_data/gcc3m_shards
            --processes_count 16 --thread_count 64
            --image_size 512 --resize_mode keep_ratio --resize_only_if_bigger True \
            --enable_wandb True --save_metadata False --oom_shard_count 6
rename -d 's/^/gcc-train-/' local_data/gcc3m_shards/*

After checking the documentation of img2dataset, I find that there is no argument --save_metadata for img2dataset and this may cause an error.

Apr 16 '22 09:04 slyviacassell

Another mistake is #7.

Apr 16 '22 09:04 slyviacassell

If not misunderstood, the command rename may not have the -d options. The tested OS is Ubuntu 18.04 LTS.

Apr 16 '22 09:04 slyviacassell

There are differences between gcc12m.tsv and gcc3m.tsv after downloading. For gcc12m.tsv, the format is [url]\t[caption], while the format for gcc3m.tsv is [caption]\t[url]. Hence the sed command has mistake.

Apr 16 '22 13:04 slyviacassell

Hi @slyviacassell

We process the dataset with img2dataset==1.12.0 https://github.com/rom1504/img2dataset/tree/1.12.0
rename has option -d in Ubuntu 20.04

2 & 4 is a mistake and we will fix it. Thank you for pointing out!

Apr 16 '22 17:04 xvjiarui

FYI, thanks.

Apr 18 '22 07:04 slyviacassell

FYI, thanks. Would like to ask if gcc3M supports breakpoint download? Also, does gcc3M have 436 .tar files? Is it possible to pre-train on gcc3M alone? thanks.

Apr 18 '22 10:04 pzhren

Hi @pzhren You may comment out "gcc12m" and "yfcc14m" to train on "gcc3m" only

Apr 19 '22 16:04 xvjiarui

Would like to ask if gcc3M supports breakpoint download?

As far as I know, the img2dataset doesn't support incremental download at this moment.

Apr 20 '22 02:04 slyviacassell

Would like to ask if gcc3M supports breakpoint download?

As far as I know, the img2dataset doesn't support incremental download at this moment.

Is it normal for the gcc3M I downloaded to have only 332 .tar files?

Apr 20 '22 08:04 pzhren

@pzhren Maybe some of the link is invalid. But it should be fine.

Apr 21 '22 04:04 xvjiarui

rename has option -d in Ubuntu 20.04

@xvjiarui Sorry for the late reply and now I am getting back to this project. I have checked the manual for rename in Ubuntu 20.04 and there is no -d option for rename. Just for sure, could you tell me the installation method of rename to figure out that we are talking about the same command? In my case, rename is installed with apt install rename. Moreover, could you also tell the purpose of the usage of the -d option?

Apr 26 '22 02:04 slyviacassell

Hi @slyviacassell

I installed rename by running sudo apt-get install rename

(base) ➜  ~ rename -h
Usage:
    rename [ -h|-m|-V ] [ -v ] [ -0 ] [ -n ] [ -f ] [ -d ]
    [ -e|-E perlexpr]*|perlexpr [ files ]

Options:
    -v, --verbose
            Verbose: print names of files successfully renamed.

    -0, --null
            Use \0 as record separator when reading from STDIN.

    -n, --nono
            No action: print names of files to be renamed, but don't rename.

    -f, --force
            Over write: allow existing files to be over-written.

    --path, --fullpath
            Rename full path: including any directory component. DEFAULT

    -d, --filename, --nopath, --nofullpath
            Do not rename directory: only rename filename component of path.

    -h, --help
            Help: print SYNOPSIS and OPTIONS.

    -m, --man
            Manual: print manual page.

    -V, --version
            Version: show version number.

    -e      Expression: code to act on files name.

            May be repeated to build up code (like "perl -e"). If no -e, the
            first argument is used as code.

    -E      Statement: code to act on files name, as -e but terminated by
            ';'.

Apr 26 '22 16:04 xvjiarui

run python convert_dataset/create_subset.py --input-dir . --output-dir . --subset yfcc100m_subset_data.tsv report an error:

INFO: Pandarallel will run on 20 workers.
INFO: Pandarallel will use Memory file system to transfer data between the main process and workers.
Traceback (most recent call last):
  File "convert_dataset/create_subset.py", line 114, in <module>
    main()
  File "convert_dataset/create_subset.py", line 64, in main
    download_db(files)
  File "/mnt/cephfs/home/rpz/anaconda3/envs/groupvit/lib/python3.7/site-packages/yfcc100m/convert_metadata.py", line 63, in download_db
    total = obj.content_length
  File "/mnt/cephfs/home/rpz/anaconda3/envs/groupvit/lib/python3.7/site-packages/boto3/resources/factory.py", line 380, in property_loader
    self.load()
  File "/mnt/cephfs/home/rpz/anaconda3/envs/groupvit/lib/python3.7/site-packages/boto3/resources/factory.py", line 564, in do_action
    response = action(self, *args, **kwargs)
  File "/mnt/cephfs/home/rpz/anaconda3/envs/groupvit/lib/python3.7/site-packages/boto3/resources/action.py", line 88, in __call__
    response = getattr(parent.meta.client, operation_name)(*args, **params)
  File "/mnt/cephfs/home/rpz/anaconda3/envs/groupvit/lib/python3.7/site-packages/botocore/client.py", line 415, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "/mnt/cephfs/home/rpz/anaconda3/envs/groupvit/lib/python3.7/site-packages/botocore/client.py", line 732, in _make_api_call
    operation_model, request_dict, request_context)
  File "/mnt/cephfs/home/rpz/anaconda3/envs/groupvit/lib/python3.7/site-packages/botocore/client.py", line 751, in _make_request
    return self._endpoint.make_request(operation_model, request_dict)
  File "/mnt/cephfs/home/rpz/anaconda3/envs/groupvit/lib/python3.7/site-packages/botocore/endpoint.py", line 107, in make_request
    return self._send_request(request_dict, operation_model)
  File "/mnt/cephfs/home/rpz/anaconda3/envs/groupvit/lib/python3.7/site-packages/botocore/endpoint.py", line 180, in _send_request
    request = self.create_request(request_dict, operation_model)
  File "/mnt/cephfs/home/rpz/anaconda3/envs/groupvit/lib/python3.7/site-packages/botocore/endpoint.py", line 121, in create_request
    operation_name=operation_model.name)
  File "/mnt/cephfs/home/rpz/anaconda3/envs/groupvit/lib/python3.7/site-packages/botocore/hooks.py", line 358, in emit
    return self._emitter.emit(aliased_event_name, **kwargs)
  File "/mnt/cephfs/home/rpz/anaconda3/envs/groupvit/lib/python3.7/site-packages/botocore/hooks.py", line 229, in emit
    return self._emit(event_name, kwargs)
  File "/mnt/cephfs/home/rpz/anaconda3/envs/groupvit/lib/python3.7/site-packages/botocore/hooks.py", line 212, in _emit
    response = handler(**kwargs)
  File "/mnt/cephfs/home/rpz/anaconda3/envs/groupvit/lib/python3.7/site-packages/botocore/signers.py", line 95, in handler
    return self.sign(operation_name, request)
  File "/mnt/cephfs/home/rpz/anaconda3/envs/groupvit/lib/python3.7/site-packages/botocore/signers.py", line 167, in sign
    auth.add_auth(request)
  File "/mnt/cephfs/home/rpz/anaconda3/envs/groupvit/lib/python3.7/site-packages/botocore/auth.py", line 401, in add_auth
    raise NoCredentialsError()
botocore.exceptions.NoCredentialsError: Unable to locate credentials

How do you deal with that?

May 03 '22 09:05 pzhren

Hi @pzhren You need to follow the instruction here and set up your own AWS credential.

May 03 '22 18:05 xvjiarui

Hi @pzhren You need to follow the instruction here and set up your own AWS credential. I set up aws configure. then run 'python convert_dataset/create_subset.py --input-dir . --output-dir . --subset yfcc100m_subset_data.tsv' generated a new error：

May 04 '22 04:05 pzhren

It seems that your credential is not correctly setup

May 04 '22 04:05 xvjiarui

It seems that your credential is not correctly setup

Is this a problem?

May 04 '22 04:05 pzhren

GroupViT GroupViT copied to clipboard

Mistakes in the data preparation command

GroupViT
GroupViT copied to clipboard