img2dataset
img2dataset copied to clipboard
Can I try downloading LAION400M with multiple PC?
Hello @rom1504.
First of all, I sincerely thanks for your great contribution to the community.
My question is Can I try downloading LAION400M with multiple PC?
It is due to that my PC has limited bandwidth.
Now I am trying to distribute by dividing laion400m-meta
to f"laion400m-meta{x}" for x in range(8)
in which each meta{x} has 4 parquet files with the index.html file.
Running the provided command (https://github.com/rom1504/img2dataset/blob/main/dataset_examples/laion400m.md#download-the-images-with-img2dataset) with replacing the url_list path,
I am facing some shard failed with error that some shard has no feather file. (e.g., 1502 shard, 1496 shard, ...)
Am I doing right?
Best regards,
Sungguk Cha
From observation, the error above happens when removing a directory.
d_shard
fs.rm(shard_path)
File "/opt/conda/lib/python3.8/site-packages/fsspec/implementations/local.py", line 169, in rm
os.remove(p)
FileNotFoundError: [Errno 2] No such file or directory: '/data-sets/LAION400M/laion400m-data/_tmp/2070.feather'
shard 2070 failed with error [Errno 2] No such file or directory: '/data-sets/LAION400M/laion400m-data/_tmp/2070.feather'
595it [1:18:01, 5.19s/it]Traceback (most recent call last):
File "/opt/conda/lib/python3.8/site-packages/img2dataset/downloader.py", line 128, in __call__
I think this error may not influence the download result.
However, lets say I have splitted the 32 meta data into 8 splits (0,1,2,3 to split 0, 4,5,6,7 to split 1, ...). The first parquets of each split finds no shard and are skipped. Is this okay?