Jonathan Shen

Results 85 comments of Jonathan Shen

(probably same as https://github.com/vllm-project/vllm/issues/16546)

here is an example with shuffle ``` import itertools import datasets import multiprocessing import torch.utils.data def gen(shard): worker_info = torch.utils.data.get_worker_info() for i in range(10): yield {'value': i, 'worker_id': worker_info.id} def...

With `interleave_datasets` ``` import itertools import datasets import multiprocessing import torch.utils.data def gen(shard, value): while True: yield {'value': value} def main(): ds = [ datasets.IterableDataset.from_generator(gen, gen_kwargs={'shard': list(range(8)), 'value': i}) for...

Same results after updating to datasets 3.6.0.

Potentially, but busy. If anyone wants to take this up please feel free to, otherwise I may or may not revisit when I have free time. For what it's worth...