datasets OSError: [Errno 24] Too many open files

Describe the bug

I am trying to load the 'default' subset of the following dataset which contains lots of files (828 per split): https://huggingface.co/datasets/mteb/biblenlp-corpus-mmteb

When trying to load it using the load_dataset function I get the following error

>>> from datasets import load_dataset
>>> d = load_dataset('mteb/biblenlp-corpus-mmteb')
Downloading readme: 100%|████████████████████████████████████████████████████████████████████████| 201k/201k [00:00<00:00, 1.07MB/s]
Resolving data files: 100%|█████████████████████████████████████████████████████████████████████| 828/828 [00:00<00:00, 1069.15it/s]
Resolving data files: 100%|███████████████████████████████████████████████████████████████████| 828/828 [00:00<00:00, 436182.33it/s]
Resolving data files: 100%|█████████████████████████████████████████████████████████████████████| 828/828 [00:00<00:00, 2228.75it/s]
Resolving data files: 100%|███████████████████████████████████████████████████████████████████| 828/828 [00:00<00:00, 646478.73it/s]
Resolving data files: 100%|███████████████████████████████████████████████████████████████████| 828/828 [00:00<00:00, 831032.24it/s]
Resolving data files: 100%|███████████████████████████████████████████████████████████████████| 828/828 [00:00<00:00, 517645.51it/s]
Downloading data: 100%|████████████████████████████████████████████████████████████████████████| 828/828 [00:33<00:00, 24.87files/s]
Downloading data: 100%|████████████████████████████████████████████████████████████████████████| 828/828 [00:30<00:00, 27.48files/s]
Downloading data: 100%|████████████████████████████████████████████████████████████████████████| 828/828 [00:30<00:00, 26.94files/s]
Generating train split: 1571592 examples [00:03, 461438.97 examples/s]
Generating test split: 11163 examples [00:00, 118190.72 examples/s]
Traceback (most recent call last):
  File ".env/lib/python3.12/site-packages/datasets/builder.py", line 1995, in _prepare_split_single
    for _, table in generator:
  File ".env/lib/python3.12/site-packages/datasets/packaged_modules/json/json.py", line 99, in _generate_tables
    with open(file, "rb") as f:
         ^^^^^^^^^^^^^^^^
  File ".env/lib/python3.12/site-packages/datasets/streaming.py", line 75, in wrapper
    return function(*args, download_config=download_config, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".env/lib/python3.12/site-packages/datasets/utils/file_utils.py", line 1224, in xopen
    file_obj = fsspec.open(file, mode=mode, *args, **kwargs).open()
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".env/lib/python3.12/site-packages/fsspec/core.py", line 135, in open
    return self.__enter__()
           ^^^^^^^^^^^^^^^^
  File ".env/lib/python3.12/site-packages/fsspec/core.py", line 103, in __enter__
    f = self.fs.open(self.path, mode=mode)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".env/lib/python3.12/site-packages/fsspec/spec.py", line 1293, in open
    f = self._open(
        ^^^^^^^^^^^
  File ".env/lib/python3.12/site-packages/datasets/filesystems/compression.py", line 81, in _open
    return self.file.open()
           ^^^^^^^^^^^^^^^^
  File ".env/lib/python3.12/site-packages/fsspec/core.py", line 135, in open
    return self.__enter__()
           ^^^^^^^^^^^^^^^^
  File ".env/lib/python3.12/site-packages/fsspec/core.py", line 103, in __enter__
    f = self.fs.open(self.path, mode=mode)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".env/lib/python3.12/site-packages/fsspec/spec.py", line 1293, in open
    f = self._open(
        ^^^^^^^^^^^
  File ".env/lib/python3.12/site-packages/fsspec/implementations/local.py", line 197, in _open
    return LocalFileOpener(path, mode, fs=self, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".env/lib/python3.12/site-packages/fsspec/implementations/local.py", line 322, in __init__
    self._open()
  File ".env/lib/python3.12/site-packages/fsspec/implementations/local.py", line 327, in _open
    self.f = open(self.path, mode=self.mode)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
OSError: [Errno 24] Too many open files: '.cache/huggingface/datasets/downloads/3a347186abfc0f9c924dde0221d246db758c7232c0101523f04a87c17d696618'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File ".env/lib/python3.12/site-packages/datasets/builder.py", line 981, in incomplete_dir
    yield tmp_dir
  File ".env/lib/python3.12/site-packages/datasets/builder.py", line 1027, in download_and_prepare
    self._download_and_prepare(
  File ".env/lib/python3.12/site-packages/datasets/builder.py", line 1122, in _download_and_prepare
    self._prepare_split(split_generator, **prepare_split_kwargs)
  File ".env/lib/python3.12/site-packages/datasets/builder.py", line 1882, in _prepare_split
    for job_id, done, content in self._prepare_split_single(
  File ".env/lib/python3.12/site-packages/datasets/builder.py", line 2038, in _prepare_split_single
    raise DatasetGenerationError("An error occurred while generating the dataset") from e
datasets.exceptions.DatasetGenerationError: An error occurred while generating the dataset

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File ".env/lib/python3.12/site-packages/datasets/load.py", line 2609, in load_dataset
    builder_instance.download_and_prepare(
  File ".env/lib/python3.12/site-packages/datasets/builder.py", line 1007, in download_and_prepare
    with incomplete_dir(self._output_dir) as tmp_output_dir:
  File "/usr/lib/python3.12/contextlib.py", line 158, in __exit__
    self.gen.throw(value)
  File ".env/lib/python3.12/site-packages/datasets/builder.py", line 988, in incomplete_dir
    shutil.rmtree(tmp_dir)
  File "/usr/lib/python3.12/shutil.py", line 785, in rmtree
    _rmtree_safe_fd(fd, path, onexc)
  File "/usr/lib/python3.12/shutil.py", line 661, in _rmtree_safe_fd
    onexc(os.scandir, path, err)
  File "/usr/lib/python3.12/shutil.py", line 657, in _rmtree_safe_fd
    with os.scandir(topfd) as scandir_it:
         ^^^^^^^^^^^^^^^^^
OSError: [Errno 24] Too many open files: '.cache/huggingface/datasets/mteb___biblenlp-corpus-mmteb/default/0.0.0/3912ed967b0834547f35b2da9470c4976b357c9a.incomplete'

I looked for the maximum number of open files on my machine (Ubuntu 24.04) and it seems to be 1024, but even when I try to load a single split (load_dataset('mteb/biblenlp-corpus-mmteb', split='train')) I get the same error

Steps to reproduce the bug

from datasets import load_dataset
d = load_dataset('mteb/biblenlp-corpus-mmteb')

Expected behavior

Load the dataset without error

Environment info

datasets version: 2.19.0
Platform: Linux-6.8.0-31-generic-x86_64-with-glibc2.39
Python version: 3.12.3
huggingface_hub version: 0.23.0
PyArrow version: 16.0.0
Pandas version: 2.2.2
fsspec version: 2024.3.1

May 07 '24 01:05 loicmagne

ulimit -n 8192 can solve this problem

May 07 '24 03:05 arthasking123

ulimit -n 8192 can solve this problem

Would there be a systematic way to do this ? The data loading is part of the MTEB library

May 07 '24 09:05 loicmagne

ulimit -n 8192 can solve this problem

Would there be a systematic way to do this ? The data loading is part of the MTEB library

I think we could modify the _prepare_split_single function

May 08 '24 10:05 arthasking123

I fixed it with https://github.com/huggingface/datasets/pull/6893, feel free to re-open if you're still having the issue :)

May 13 '24 13:05 lhoestq

I fixed it with #6893, feel free to re-open if you're still having the issue :)

Thanks a lot!

May 13 '24 15:05 loicmagne

datasets datasets copied to clipboard

OSError: [Errno 24] Too many open files

Describe the bug

Steps to reproduce the bug

Expected behavior

Environment info

datasets
datasets copied to clipboard