datacomp
datacomp copied to clipboard
FileNotFoundError while downloading DataComp-1B
Thanks for the great work. I encountered the following issue while downloading the DataComp-1B dataset:
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/opt/conda/envs/datacomp/lib/python3.10/multiprocessing/spawn.py", line 116, in spawn_main
exitcode = _main(fd, parent_sentinel)
File "/opt/conda/envs/datacomp/lib/python3.10/multiprocessing/spawn.py", line 126, in _main
self = reduction.pickle.load(from_parent)
File "/opt/conda/envs/datacomp/lib/python3.10/multiprocessing/synchronize.py", line 110, in __setstate__
self._semlock = _multiprocessing.SemLock._rebuild(*state)
FileNotFoundError: [Errno 2] No such file or directory
Hi @xfgao!
To help identify the issue, could you share the command with which you are running download_upstream.py
? Also, is the above the full error message?
Thanks!
We were running the following command to download data:
python download_upstream.py --scale datacomp_1b --data_dir DATA_DIR
We were able to download all the metadata and a bunch of tar files at the beginning, but after a certain point we keep getting the error message:
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/opt/conda/envs/datacomp/lib/python3.10/multiprocessing/spawn.py", line 116, in spawn_main
exitcode = _main(fd, parent_sentinel)
File "/opt/conda/envs/datacomp/lib/python3.10/multiprocessing/spawn.py", line 126, in _main
self = reduction.pickle.load(from_parent)
File "/opt/conda/envs/datacomp/lib/python3.10/multiprocessing/synchronize.py", line 110, in __setstate__
self._semlock = _multiprocessing.SemLock._rebuild(*state)
FileNotFoundError: [Errno 2] No such file or directory
The error seems to be a limit at the OS level on the number of thread you can open
Either decrease the number of thread you are using, increase the limit or use a different machine
If you provide some info on your environment it could help
Thanks for the response. For the data downloading, I'm using Ubuntu 20.04 on an AWS EC2 g5.12xlarge instance (with 48 cpu cores). After reducing the processes_count
to 8 and thread_count
to 8, I'm still getting the the same FileNotFoundError
error.
Can you try using virtual env instead of conda?
Do we have a requirement.txt file for setting up virtual env?
You can try installing the packages listed under pip
in the environment.yml
, if I am not mistaken it should achieve something similar to the desired environment if your system python is the correct version (although this needs to be verified). You should still train with the original environment to avoid other issues - but for just the data download it should be fine.