datasets Distributed data parallel training for streaming datasets

Feature request

Any documentations for the the load_dataset(streaming=True) for (multi-node multi-GPU) DDP training?

Motivation

Given a bunch of data files, it is expected to split them onto different GPUs. Is there a guide or documentation?

Your contribution

Does it requires manually split on data files for each worker in DatasetBuilder._split_generator()? What isIterableDatasetShard expected to do?

Jul 17 '22 01:07 cyk1337

Hi ! According to https://huggingface.co/docs/datasets/use_with_pytorch#stream-data you can use the pytorch DataLoader with num_workers>0 to distribute the shards across your workers (it uses torch.utils.data.get_worker_info() to get the worker ID and select the right subsets of shards to use)

~~EDIT: here is a code example~~

# ds = ds.with_format("torch")
# dataloader = DataLoader(ds, num_workers=num_workers)

EDIT: with_format("torch") is not required, now you can just do

dataloader = DataLoader(ds, num_workers=num_workers)

Jul 25 '22 16:07 lhoestq

@cyk1337 does streaming datasets with multi-gpu works for you? I am testing on one node with multiple gpus, but this is freezing, https://github.com/huggingface/datasets/issues/5123 In case you could make this work, could you share with me your data-loading codes? thank you

Oct 24 '22 16:10 jackfeinmann5

+1

Feb 28 '23 13:02 Mohammed20201991

This has been implemented in datasets 2.8:

from datasets.distributed import split_dataset_by_node

ds = split_dataset_by_node(ds, rank=rank, world_size=world_size)

docs: https://huggingface.co/docs/datasets/use_with_pytorch#distributed

Feb 28 '23 14:02 lhoestq

i'm having hanging issues with this when using DDP and allocating the datasets with split_dataset_by_node 🤔

edit

I don't want to pollute this thread, but for the sake of following up, I observed hanging close to the final iteration of the dataloader. I think this was happening on the final shard. First, I removed the final shard and things worked. Then (including all shards), I reordered the list of shards: load_dataset('json', data_files=reordered, streaming=True) and no hang.

I won't open an issue yet bc I am not quite sure about this observation.

Apr 26 '23 05:04 wconnell

@wconnell would you mind opening a different bug issue and giving more details? https://github.com/huggingface/datasets/issues/new?assignees=&labels=&template=bug-report.yml

Thanks.

Apr 26 '23 08:04 albertvillanova

datasets datasets copied to clipboard

Distributed data parallel training for streaming datasets

Feature request

Motivation

Your contribution

edit

datasets
datasets copied to clipboard