Loading local datasets got strangely stuck
Describe the bug
I try to use load_dataset() to load several local .jsonl files as a dataset. Every line of these files is a json structure only containing one key text (yeah it is a dataset for NLP model). The code snippet is as:
ds = load_dataset("json", data_files=LIST_OF_FILE_PATHS, num_proc=16)['train']
However, I found that the loading process can get stuck -- the progress bar Generating train split no more proceed. When I was trying to find the cause and solution, I found a really strange behavior. If I load the dataset in this way:
dlist = list()
for _ in LIST_OF_FILE_PATHS:
dlist.append(load_dataset("json", data_files=_)['train'])
ds = concatenate_datasets(dlist)
I can actually successfully load all the files despite its slow speed. But if I load them in batch like above, things go wrong. I did try to use Control-C to trace the stuck point but the program cannot be terminated in this way when num_proc is set to None. The only thing I can do is use Control-Z to hang it up then kill it. If I use more than 2 cpus, a Control-C would simply cause the following error:
^C
Process ForkPoolWorker-1:
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/multiprocess/process.py", line 314, in _bootstrap
self.run()
File "/usr/local/lib/python3.10/dist-packages/multiprocess/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/usr/local/lib/python3.10/dist-packages/multiprocess/pool.py", line 114, in worker
task = get()
File "/usr/local/lib/python3.10/dist-packages/multiprocess/queues.py", line 368, in get
res = self._reader.recv_bytes()
File "/usr/local/lib/python3.10/dist-packages/multiprocess/connection.py", line 224, in recv_bytes
buf = self._recv_bytes(maxlength)
File "/usr/local/lib/python3.10/dist-packages/multiprocess/connection.py", line 422, in _recv_bytes
buf = self._recv(4)
File "/usr/local/lib/python3.10/dist-packages/multiprocess/connection.py", line 387, in _recv
chunk = read(handle, remaining)
KeyboardInterrupt
Generating train split: 92431 examples [01:23, 1104.25 examples/s]
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/datasets/utils/py_utils.py", line 1373, in iflatmap_unordered
yield queue.get(timeout=0.05)
File "<string>", line 2, in get
File "/usr/local/lib/python3.10/dist-packages/multiprocess/managers.py", line 818, in _callmethod
kind, result = conn.recv()
File "/usr/local/lib/python3.10/dist-packages/multiprocess/connection.py", line 258, in recv
buf = self._recv_bytes()
File "/usr/local/lib/python3.10/dist-packages/multiprocess/connection.py", line 422, in _recv_bytes
buf = self._recv(4)
File "/usr/local/lib/python3.10/dist-packages/multiprocess/connection.py", line 387, in _recv
chunk = read(handle, remaining)
KeyboardInterrupt
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/mnt/data/liyongyuan/source/batch_load.py", line 11, in <module>
a = load_dataset(
File "/usr/local/lib/python3.10/dist-packages/datasets/load.py", line 2133, in load_dataset
builder_instance.download_and_prepare(
File "/usr/local/lib/python3.10/dist-packages/datasets/builder.py", line 954, in download_and_prepare
self._download_and_prepare(
File "/usr/local/lib/python3.10/dist-packages/datasets/builder.py", line 1049, in _download_and_prepare
self._prepare_split(split_generator, **prepare_split_kwargs)
File "/usr/local/lib/python3.10/dist-packages/datasets/builder.py", line 1842, in _prepare_split
for job_id, done, content in iflatmap_unordered(
File "/usr/local/lib/python3.10/dist-packages/datasets/utils/py_utils.py", line 1387, in iflatmap_unordered
[async_result.get(timeout=0.05) for async_result in async_results]
File "/usr/local/lib/python3.10/dist-packages/datasets/utils/py_utils.py", line 1387, in <listcomp>
[async_result.get(timeout=0.05) for async_result in async_results]
File "/usr/local/lib/python3.10/dist-packages/multiprocess/pool.py", line 770, in get
raise TimeoutError
multiprocess.context.TimeoutError
I have validated the basic correctness of these .jsonl files. They are correctly formatted (or they cannot be loaded singly by load_dataset) though some of the json may contain too long text (more than 1e7 characters). I do not know if this could be the problem. And there should not be any bottleneck in system's resource. The whole dataset is ~300GB, and I am using a cloud server with plenty of storage and 1TB ram.
Thanks for your efforts and patience! Any suggestion or help would be appreciated.
Steps to reproduce the bug
- use load_dataset() with
data_files = LIST_OF_FILES
Expected behavior
All the files should be smoothly loaded.
Environment info
- Datasets: A private dataset. ~2500
.jsonlfiles. ~300GB in total. Each json structure only contains one key:text. Format checked. datasetsversion: 2.14.2- Platform: Linux-4.19.91-014.kangaroo.alios7.x86_64-x86_64-with-glibc2.35
- Python version: 3.10.6
- Huggingface_hub version: 0.15.1
- PyArrow version: 10.0.1.dev0+ga6eabc2b.d20230609
- Pandas version: 1.5.2
Yesterday I waited for more than 12 hours to make sure it was really stuck instead of proceeding too slow.
I've had similar weird issues with load_dataset as well. Not multiple files, but dataset is quite big, about 50G.
We use a generic multiprocessing code, so there is little we can do about this - unfortunately, turning off multiprocessing seems to be the only solution. Multithreading would make our code easier to maintain and (most likely) avoid issues such as this one, but we cannot use it until the GIL is dropped (no-GIL Python should be released in 2024, so we can start exploring this then)
The problem seems to be the Generating train split. Is it possible to avoid that? I have a dataset saved, just want to load it but somehow running into issues with that again.
Hey guys, recently I ran into this problem again and I spent one whole day trying to locate the problem. Finally I found the problem seems to be with pyarrow's json parser, and it seems a long-existing problem. Similar issue can be found in #2181. Anyway, my solution is to adjust the load_dataset's parameter chunksize. You can inspect the parameter set in datasets/packaged_modules/json/json.py, now the actual chunksize should be very small, and you can increase the value. For me, chunksize=10<<23 could solve the stuck problem. But I also find that too big chunksize, like 10 << 30, would also cause a stuck, which is rather weird. I think I may explore this when I am free. And hope this can help those who also encounter the same problem.
Experiencing the same issue with the kaist-ai/Feedback-Collection dataset, which is comparatively small i.e. 100k rows.
Code to reproduce
from datasets import load_dataset
dataset = load_dataset("kaist-ai/Feedback-Collection")
I have tried setting num_proc=1 as well as chunksize=1024, 64 but problem persists. Any pointers?