datasets
datasets copied to clipboard
OSError: [Errno 24] Too many open files
Describe the bug
I am trying to load the 'default' subset of the following dataset which contains lots of files (828 per split): https://huggingface.co/datasets/mteb/biblenlp-corpus-mmteb
When trying to load it using the load_dataset
function I get the following error
>>> from datasets import load_dataset
>>> d = load_dataset('mteb/biblenlp-corpus-mmteb')
Downloading readme: 100%|████████████████████████████████████████████████████████████████████████| 201k/201k [00:00<00:00, 1.07MB/s]
Resolving data files: 100%|█████████████████████████████████████████████████████████████████████| 828/828 [00:00<00:00, 1069.15it/s]
Resolving data files: 100%|███████████████████████████████████████████████████████████████████| 828/828 [00:00<00:00, 436182.33it/s]
Resolving data files: 100%|█████████████████████████████████████████████████████████████████████| 828/828 [00:00<00:00, 2228.75it/s]
Resolving data files: 100%|███████████████████████████████████████████████████████████████████| 828/828 [00:00<00:00, 646478.73it/s]
Resolving data files: 100%|███████████████████████████████████████████████████████████████████| 828/828 [00:00<00:00, 831032.24it/s]
Resolving data files: 100%|███████████████████████████████████████████████████████████████████| 828/828 [00:00<00:00, 517645.51it/s]
Downloading data: 100%|████████████████████████████████████████████████████████████████████████| 828/828 [00:33<00:00, 24.87files/s]
Downloading data: 100%|████████████████████████████████████████████████████████████████████████| 828/828 [00:30<00:00, 27.48files/s]
Downloading data: 100%|████████████████████████████████████████████████████████████████████████| 828/828 [00:30<00:00, 26.94files/s]
Generating train split: 1571592 examples [00:03, 461438.97 examples/s]
Generating test split: 11163 examples [00:00, 118190.72 examples/s]
Traceback (most recent call last):
File ".env/lib/python3.12/site-packages/datasets/builder.py", line 1995, in _prepare_split_single
for _, table in generator:
File ".env/lib/python3.12/site-packages/datasets/packaged_modules/json/json.py", line 99, in _generate_tables
with open(file, "rb") as f:
^^^^^^^^^^^^^^^^
File ".env/lib/python3.12/site-packages/datasets/streaming.py", line 75, in wrapper
return function(*args, download_config=download_config, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File ".env/lib/python3.12/site-packages/datasets/utils/file_utils.py", line 1224, in xopen
file_obj = fsspec.open(file, mode=mode, *args, **kwargs).open()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File ".env/lib/python3.12/site-packages/fsspec/core.py", line 135, in open
return self.__enter__()
^^^^^^^^^^^^^^^^
File ".env/lib/python3.12/site-packages/fsspec/core.py", line 103, in __enter__
f = self.fs.open(self.path, mode=mode)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File ".env/lib/python3.12/site-packages/fsspec/spec.py", line 1293, in open
f = self._open(
^^^^^^^^^^^
File ".env/lib/python3.12/site-packages/datasets/filesystems/compression.py", line 81, in _open
return self.file.open()
^^^^^^^^^^^^^^^^
File ".env/lib/python3.12/site-packages/fsspec/core.py", line 135, in open
return self.__enter__()
^^^^^^^^^^^^^^^^
File ".env/lib/python3.12/site-packages/fsspec/core.py", line 103, in __enter__
f = self.fs.open(self.path, mode=mode)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File ".env/lib/python3.12/site-packages/fsspec/spec.py", line 1293, in open
f = self._open(
^^^^^^^^^^^
File ".env/lib/python3.12/site-packages/fsspec/implementations/local.py", line 197, in _open
return LocalFileOpener(path, mode, fs=self, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File ".env/lib/python3.12/site-packages/fsspec/implementations/local.py", line 322, in __init__
self._open()
File ".env/lib/python3.12/site-packages/fsspec/implementations/local.py", line 327, in _open
self.f = open(self.path, mode=self.mode)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
OSError: [Errno 24] Too many open files: '.cache/huggingface/datasets/downloads/3a347186abfc0f9c924dde0221d246db758c7232c0101523f04a87c17d696618'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File ".env/lib/python3.12/site-packages/datasets/builder.py", line 981, in incomplete_dir
yield tmp_dir
File ".env/lib/python3.12/site-packages/datasets/builder.py", line 1027, in download_and_prepare
self._download_and_prepare(
File ".env/lib/python3.12/site-packages/datasets/builder.py", line 1122, in _download_and_prepare
self._prepare_split(split_generator, **prepare_split_kwargs)
File ".env/lib/python3.12/site-packages/datasets/builder.py", line 1882, in _prepare_split
for job_id, done, content in self._prepare_split_single(
File ".env/lib/python3.12/site-packages/datasets/builder.py", line 2038, in _prepare_split_single
raise DatasetGenerationError("An error occurred while generating the dataset") from e
datasets.exceptions.DatasetGenerationError: An error occurred while generating the dataset
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File ".env/lib/python3.12/site-packages/datasets/load.py", line 2609, in load_dataset
builder_instance.download_and_prepare(
File ".env/lib/python3.12/site-packages/datasets/builder.py", line 1007, in download_and_prepare
with incomplete_dir(self._output_dir) as tmp_output_dir:
File "/usr/lib/python3.12/contextlib.py", line 158, in __exit__
self.gen.throw(value)
File ".env/lib/python3.12/site-packages/datasets/builder.py", line 988, in incomplete_dir
shutil.rmtree(tmp_dir)
File "/usr/lib/python3.12/shutil.py", line 785, in rmtree
_rmtree_safe_fd(fd, path, onexc)
File "/usr/lib/python3.12/shutil.py", line 661, in _rmtree_safe_fd
onexc(os.scandir, path, err)
File "/usr/lib/python3.12/shutil.py", line 657, in _rmtree_safe_fd
with os.scandir(topfd) as scandir_it:
^^^^^^^^^^^^^^^^^
OSError: [Errno 24] Too many open files: '.cache/huggingface/datasets/mteb___biblenlp-corpus-mmteb/default/0.0.0/3912ed967b0834547f35b2da9470c4976b357c9a.incomplete'
I looked for the maximum number of open files on my machine (Ubuntu 24.04) and it seems to be 1024, but even when I try to load a single split (load_dataset('mteb/biblenlp-corpus-mmteb', split='train')
) I get the same error
Steps to reproduce the bug
from datasets import load_dataset
d = load_dataset('mteb/biblenlp-corpus-mmteb')
Expected behavior
Load the dataset without error
Environment info
-
datasets
version: 2.19.0 - Platform: Linux-6.8.0-31-generic-x86_64-with-glibc2.39
- Python version: 3.12.3
-
huggingface_hub
version: 0.23.0 - PyArrow version: 16.0.0
- Pandas version: 2.2.2
-
fsspec
version: 2024.3.1
ulimit -n 8192 can solve this problem
ulimit -n 8192 can solve this problem
Would there be a systematic way to do this ? The data loading is part of the MTEB library
ulimit -n 8192 can solve this problem
Would there be a systematic way to do this ? The data loading is part of the MTEB library
I think we could modify the _prepare_split_single function
I fixed it with https://github.com/huggingface/datasets/pull/6893, feel free to re-open if you're still having the issue :)
I fixed it with #6893, feel free to re-open if you're still having the issue :)
Thanks a lot!