datasets Loading JSON gets stuck with many workers/threads

Describe the bug

Loading a JSON dataset with load_dataset can get stuck when running on a machine with many CPUs. This is especially an issue when loading a large dataset on a large machine.

Steps to reproduce the bug

I originally created the following script to reproduce the issue:

from datasets import load_dataset
from multiprocessing import Process
from tqdm import tqdm
import datasets
from transformers import set_seed

def run_tasks_in_parallel(tasks, ds_list):
    for _ in tqdm(range(1000)):
        print('new batch')
        running_tasks = [Process(target=task, args=(ds, i)) for i, (task, ds) in enumerate(zip(tasks, ds_list))]
        for running_task in running_tasks:
            running_task.start()
        for running_task in running_tasks:
            running_task.join()

def get_dataset():
    dataset_name = 'transformersbook/codeparrot'
    ds = load_dataset(dataset_name+'-train', split="train", streaming=True)
    ds = ds.shuffle(buffer_size=1000, seed=1)
    return iter(ds)

def get_next_element(ds, process_id,  N=10000):
    for _ in range(N):
        _ = next(ds)['content']
    print(f'process {process_id} done')
    return

set_seed(1)
datasets.utils.logging.set_verbosity_debug()

n_processes = 8
tasks = [get_next_element for _ in range(n_processes)]
args = [get_dataset() for _ in range(n_processes)]
run_tasks_in_parallel(tasks, args)

Today I noticed that it can happen when running it on a single process on a machine with many cores without streaming. So just load_dataset("transformersbook/codeparrot-train") alone might cause the issue after waiting long enough or trying many times. It's a slightly random process which makes it especially hard to track down. When I encountered it today it had already processed 17GB of data (the size of the cache folder when it got stuck) before getting stuck.

Here's my current understanding of the error. As far as I can tell it happens in the following block: https://github.com/huggingface/datasets/blob/be701e9e89ab38022612c7263edc015bc7feaff9/src/datasets/packaged_modules/json/json.py#L119-L139

When the try on line 121 fails and the block_size is increased it can happen that it can't read the JSON again and gets stuck indefinitely. A hint that points in that direction is that increasing the chunksize argument decreases the chance of getting stuck and vice versa. Maybe it is an issue with a lock on the file that is not properly released.

Expected results

Read a JSON before the end of the universe.

Actual results

Read a JSON not before the end of the universe.

Environment info

datasets version: 1.18.3
Platform: Linux-4.19.0-18-cloud-amd64-x86_64-with-glibc2.28
Python version: 3.9.10
PyArrow version: 7.0.0

@lhoestq we dicsussed this a while ago. @albertvillanova we discussed this today :)

Feb 11 '22 18:02 lvwerra

Hi ! Note that it does block_size *= 2 until block_size > len(batch), so it doesn't loop indefinitely. What do you mean by "get stuck indefinitely" then ? Is this the actual call to paj.read_json that hangs ?

increasing the chunksize argument decreases the chance of getting stuck

Could you share the values of chunksize that you're using to observe this ? And maybe the order of magnitude of number of bytes per line of JSON ?

Feb 11 '22 20:02 lhoestq

To clarify, I don't think it loops indefinitely but the paj.read_json gets stuck after the first try. That's why I think it could be an issue with a lock somewhere.

Using load_dataset(..., chunksize=40<<20) worked without errors.

Feb 11 '22 20:02 lvwerra

@lhoestq I encountered another related issue. I use load_dataset() for my json data and set_transform() for preprocessing. But it hangs at the end of the epoch if dataloader_num_workers>=1. It appears to be working fine with num_worker=0, but it's slow.

train_dataset = datasets.load_dataset("json", 
                                      data_files=corpus_jsonl_path,
                                      keep_in_memory=False,
                                      cache_dir=model_args.cache_dir,
                                      streaming=False)
train_dataset.set_transform(psg_parse_fn)

Jul 30 '22 09:07 memray

I couldn't I think your problem is unrelated to this issue @memray Indeed this issue discusses a bug when doing load_dataset, while your case has to do with the dataloader in a multiprocessing setup. Can you open a new issue and provide more details (share your env and what psg_parse_fn does) ?

Aug 17 '22 14:08 lhoestq

I also encountered a similar issue when loading a 190GB dataset of jsonl files (255 files with less than 1Gb) where it got stuck for over 20h at tables generation (fig below), increasing the chunksize with load_dataset(..., chunksize=40<<20) fixed the issue

Oct 13 '22 13:10 loubnabnl

@lhoestq I encountered another related issue. I use load_dataset() for my json data and set_transform() for preprocessing. But it hangs at the end of the epoch if dataloader_num_workers>=1. It appears to be working fine with num_worker=0, but it's slow.
train_dataset = datasets.load_dataset("json", 
                                      data_files=corpus_jsonl_path,
                                      keep_in_memory=False,
                                      cache_dir=model_args.cache_dir,
                                      streaming=False)
train_dataset.set_transform(psg_parse_fn)

In case people also get this problem, I found a way to fix it by adding persistent_workers=True when initializing DataLoader, like: train_loader = DataLoader( train_dataset, batch_size=self._train_batch_size, sampler=train_sampler, collate_fn=data_collator, num_workers=self.args.dataloader_num_workers, persistent_workers=True )

The error was CUDA error: initialization error Exception raised from insert_events at ../c10/cuda/CUDACachingAllocator.cpp:1266 after the 1st epoch, I guess it's because the data_loader worker is killed after each epoch and the data supply is cut off. This error only occurs when num_workers>1.

Jan 17 '23 18:01 memray

I can confirm the issue using datasets (2.12.0) with the following code and Accelerate (0.20.3) env:

trainDataloader = DataLoader(trainSplit, batch_size=args.train_batch_size, shuffle=True)
evalDataloader = DataLoader(validSplit, batch_size=args.valid_batch_size) // Here is where it gets stuck.

- `Accelerate` version: 0.20.3
- Platform: Linux-5.4.0-150-generic-x86_64-with-glibc2.29
- Python version: 3.8.10
- Numpy version: 1.24.3
- PyTorch version (GPU?): 2.0.1+cu117 (True)
- PyTorch XPU available: False
- System RAM: 503.28 GB
- GPU type: Tesla V100-SXM2-32GB
- `Accelerate` default config:
	- compute_environment: LOCAL_MACHINE
	- distributed_type: MULTI_GPU
	- mixed_precision: fp16
	- use_cpu: False
	- num_processes: 2
	- machine_rank: 0
	- num_machines: 1
	- gpu_ids: 0,1
	- rdzv_backend: static
	- same_network: True
	- main_training_function: main
	- downcast_bf16: no
	- tpu_use_cluster: False
	- tpu_use_sudo: False
	- tpu_env: []

Notable that with Accelerate configured for one GPU only, it doesn't get stuck.

The suggestion made by @memray worked in my case. This is how it was applied:

trainDataloader = DataLoader(trainSplit, batch_size=args.train_batch_size, shuffle=True, num_workers=2, persistent_workers=True)
evalDataloader = DataLoader(validSplit, batch_size=args.valid_batch_size, num_workers=2, persistent_workers=True)

Jun 16 '23 09:06 mvasiliniuc

I think your issue is related to accelerate, feel free to open an issue there: https://github.com/huggingface/accelerate/issues

Dataset objects generally work fine with the torch DataLoader, idk what accelerate does that could make it get stuck.

Jun 16 '23 11:06 lhoestq

datasets datasets copied to clipboard

Loading JSON gets stuck with many workers/threads

Describe the bug

Steps to reproduce the bug

Expected results

Actual results

Environment info

datasets
datasets copied to clipboard