Sebastian Hoffmann
Sebastian Hoffmann
Notice that the above example is just for demonstration purposes. In a real pipeline these two sharding operations might take place in vastly different places. So replacing them with one...
@ejguan set_graph_random_seed does not account for different sharding priorities as well (https://github.com/pytorch/data/blob/main/torchdata/dataloader2/graph/settings.py#L31)
I find the intended behavior a bit problematic anyways: The usual principle wrt multiprocessing right now is that every worker executes the same pipeline. If a sharding filter is encountered;...
It is also not clear to me right now how such shuffle operations are supposed to behave if one wants to set a fixed seed via `Dataloader2.seed()`.
@ejguan I believe this has been fixed by https://github.com/pytorch/pytorch/pull/97287. Is that correct?
No, sorry, I'm afraid not. A fix could look like this: https://github.com/ejguan/pytorch/blob/f2cea87c1f9741e78c60c456bb0cd0f22d0689f7/torch/utils/data/graph_settings.py#L65 ``` if len(sig.parameters) < 3: sharded = dp.apply_sharding(num_of_instances, instance_id) else: sharded = dp.apply_sharding(num_of_instances, instance_id, sharding_group=sharding_group) if sharded: applied...
> I am sorry that I think we currently don't support two `ShardingRoundRobinDispatcher` This should potentially be taken into consideration as a usecase with regards to https://github.com/pytorch/data/issues/1174
Hey, thanks for the update. Does that mean that torchdata will become obsolete in the future? As I already indicated in older issues, what I see as the biggest weakness...
Hey @andrewkho, this is very cool! The design makes a lot of sense for me and the focus on modularity and fine-grained control over parallelism and sharding is very appreciated!...