Jonathan Shen
Jonathan Shen
(probably same as https://github.com/vllm-project/vllm/issues/16546)
here is an example with shuffle ``` import itertools import datasets import multiprocessing import torch.utils.data def gen(shard): worker_info = torch.utils.data.get_worker_info() for i in range(10): yield {'value': i, 'worker_id': worker_info.id} def...
With `interleave_datasets` ``` import itertools import datasets import multiprocessing import torch.utils.data def gen(shard, value): while True: yield {'value': value} def main(): ds = [ datasets.IterableDataset.from_generator(gen, gen_kwargs={'shard': list(range(8)), 'value': i}) for...
Same results after updating to datasets 3.6.0.
Potentially, but busy. If anyone wants to take this up please feel free to, otherwise I may or may not revisit when I have free time. For what it's worth...