img2dataset icon indicating copy to clipboard operation
img2dataset copied to clipboard

Can I try downloading LAION400M with multiple PC?

Open sunggukcha opened this issue 1 year ago • 1 comments

Hello @rom1504.

First of all, I sincerely thanks for your great contribution to the community.

My question is Can I try downloading LAION400M with multiple PC? It is due to that my PC has limited bandwidth.

Now I am trying to distribute by dividing laion400m-meta to f"laion400m-meta{x}" for x in range(8) in which each meta{x} has 4 parquet files with the index.html file. Running the provided command (https://github.com/rom1504/img2dataset/blob/main/dataset_examples/laion400m.md#download-the-images-with-img2dataset) with replacing the url_list path, I am facing some shard failed with error that some shard has no feather file. (e.g., 1502 shard, 1496 shard, ...)

Am I doing right?

Best regards,

Sungguk Cha

sunggukcha avatar Mar 31 '23 07:03 sunggukcha

From observation, the error above happens when removing a directory.

d_shard
    fs.rm(shard_path)
  File "/opt/conda/lib/python3.8/site-packages/fsspec/implementations/local.py", line 169, in rm
    os.remove(p)
FileNotFoundError: [Errno 2] No such file or directory: '/data-sets/LAION400M/laion400m-data/_tmp/2070.feather'
shard 2070 failed with error [Errno 2] No such file or directory: '/data-sets/LAION400M/laion400m-data/_tmp/2070.feather'
595it [1:18:01,  5.19s/it]Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/site-packages/img2dataset/downloader.py", line 128, in __call__

I think this error may not influence the download result.

However, lets say I have splitted the 32 meta data into 8 splits (0,1,2,3 to split 0, 4,5,6,7 to split 1, ...). The first parquets of each split finds no shard and are skipped. Is this okay?

sunggukcha avatar Mar 31 '23 08:03 sunggukcha