data [Discussion] Snapshot state for DataPipe

🚀 The feature

List the reference of state transition proposal in the PR:

Initial state: NotStarted
- iter -> Call reset and change the state to Iterating
- [Discuss] serialize/deserialize -> Keep the state as NotStarted (will reset if iter is called afterwards)
Initial state: Iterating
- iter -> Call reset and keep the state to Iterating
- [Discuss] serialize/deserialize -> Change the state as Restored
Initial state: Restored
- iter -> Only change the state to Iterating
- serialize/deserialize -> Not allowed

There are a few things I want to discuss:

Does serialization/deserialization using __getstate__ and __setstate__ counts as snapshot as shown above?
- I personally think so as fast forward would be the backup way to run snapshot. It would be better to use __getstate__ to consolidate the state of DataPipe and restore it without wasting time fast forwarding.
- And, if so, should we make metaclass to automatically change the state when __setstate__ is invoked?
- And, should DataPipe converts its state from Iterating to NotStarted after every iteration ends? I know I have discussed it with Kevin the first place, but I completely forgot how we reached an agreement on keeping it as Iterating at the end of iteration. Now, I just realized there might be a corner case that this decision would screw up, especially when I wrote test cases.

dp = IterableWrapper(list(range(1000))
dl = DataLoader(dp, num_workers=0)
_ = list(dl)  # Running some tests
# After the above line, the DataPipe state becomes `Iterating` since single-process
# Now I create DataLoader with multiple workers
dl = DataLoader(dp, num_workers=2)
_ = list(dl)
# The DataPipe state transits from `Iterating` to `Restored` due to multiprocessing, then `reset` is not invoked

Technical speaking, this problem only happens with DataLoader but not DataLoader2 since DLv2 should always use a copy of DataPipe graph (even though it's not true for now). It would be great if we have a clear boundary for each state.

cc: @NivekT @VitalyFedyunin

Motivation, pitch

We need to clarify the state transition for both users and developers.

Alternatives

No response

Additional context

No response

Aug 17 '22 19:08 ejguan

I personally think so as fast forward would be the backup way to run snapshot. It would be better to use getstate to consolidate the state of DataPipe and restore it without wasting time fast forwarding.

Agree and that has been part of the plan. https://github.com/pytorch/pytorch/pull/79658

And, if so, should we make metaclass to automatically change the state when __setstate__ is invoked?

That seems reasonable given that this proposal also change the first condition i.e. NotStarted -> NotStarted in the restoration process.

convert its state from Iterating to NotStarted after every iteration ends

This might work for now.

It gets trickier when you are restoring the buffers of container DataPipes with children (e.g. demux, fork). Sometimes one child may reach StopIteration and the other child hasn't.

Currently, the Snapshot state of those containers aren't changing but that is fine because we are only using naive fast-forwarding. I believe it will have to check is_every_instance_exhausted to determine if it is exhausted once we start restoring buffers directly. There may be some corner cases related to these, but they are not an immediate concern until I work on full snapshotting.

Overall, I think the proposal is good and it should work.

Aug 18 '22 00:08 NivekT

It gets trickier when you are restoring the buffers of container DataPipes with children (e.g. demux, fork). Sometimes one child may reach StopIteration and the other child hasn't.

Yeah, reasonable concern. And, I feel this is solvable as parent has aware of when it reaches end of iteration even though they don't have __iter__ function.

Aug 18 '22 14:08 ejguan

Closing it for now as I feel this has been discussed and landed

Jan 27 '23 14:01 ejguan

data data copied to clipboard

[Discussion] Snapshot state for DataPipe

🚀 The feature

Motivation, pitch

Alternatives

Additional context

data
data copied to clipboard