datasets
datasets copied to clipboard
Distributed data parallel training for streaming datasets
Feature request
Any documentations for the the load_dataset(streaming=True)
for (multi-node multi-GPU) DDP training?
Motivation
Given a bunch of data files, it is expected to split them onto different GPUs. Is there a guide or documentation?
Your contribution
Does it requires manually split on data files for each worker in DatasetBuilder._split_generator()
? What isIterableDatasetShard
expected to do?
Hi ! According to https://huggingface.co/docs/datasets/use_with_pytorch#stream-data you can use the pytorch DataLoader with num_workers>0
to distribute the shards across your workers (it uses torch.utils.data.get_worker_info()
to get the worker ID and select the right subsets of shards to use)
EDIT: here is a code example
# ds = ds.with_format("torch")
# dataloader = DataLoader(ds, num_workers=num_workers)
EDIT: with_format("torch")
is not required, now you can just do
dataloader = DataLoader(ds, num_workers=num_workers)
@cyk1337 does streaming datasets with multi-gpu works for you? I am testing on one node with multiple gpus, but this is freezing, https://github.com/huggingface/datasets/issues/5123 In case you could make this work, could you share with me your data-loading codes? thank you
+1
This has been implemented in datasets
2.8:
from datasets.distributed import split_dataset_by_node
ds = split_dataset_by_node(ds, rank=rank, world_size=world_size)
docs: https://huggingface.co/docs/datasets/use_with_pytorch#distributed
i'm having hanging issues with this when using DDP and allocating the datasets with split_dataset_by_node
🤔
edit
I don't want to pollute this thread, but for the sake of following up, I observed hanging close to the final iteration of the dataloader. I think this was happening on the final shard. First, I removed the final shard and things worked. Then (including all shards), I reordered the list of shards: load_dataset('json', data_files=reordered, streaming=True)
and no hang.
I won't open an issue yet bc I am not quite sure about this observation.
@wconnell would you mind opening a different bug issue and giving more details? https://github.com/huggingface/datasets/issues/new?assignees=&labels=&template=bug-report.yml
Thanks.