data
data copied to clipboard
[Discussion] Snapshot state for DataPipe
🚀 The feature
List the reference of state transition proposal in the PR:
- Initial state:
NotStarted
-
iter
-> Callreset
and change the state toIterating
- [Discuss] serialize/deserialize -> Keep the state as
NotStarted
(willreset
ifiter
is called afterwards)
-
- Initial state:
Iterating
-
iter
-> Callreset
and keep the state toIterating
- [Discuss] serialize/deserialize -> Change the state as
Restored
-
- Initial state:
Restored
-
iter
-> Only change the state toIterating
- serialize/deserialize -> Not allowed
-
There are a few things I want to discuss:
- Does serialization/deserialization using
__getstate__
and__setstate__
counts as snapshot as shown above?- I personally think so as fast forward would be the backup way to run snapshot. It would be better to use
__getstate__
to consolidate the state of DataPipe and restore it without wasting time fast forwarding. - And, if so, should we make
metaclass
to automatically change the state when__setstate__
is invoked? - And, should DataPipe converts its state from
Iterating
toNotStarted
after every iteration ends? I know I have discussed it with Kevin the first place, but I completely forgot how we reached an agreement on keeping it asIterating
at the end of iteration. Now, I just realized there might be a corner case that this decision would screw up, especially when I wrote test cases.
- I personally think so as fast forward would be the backup way to run snapshot. It would be better to use
dp = IterableWrapper(list(range(1000))
dl = DataLoader(dp, num_workers=0)
_ = list(dl) # Running some tests
# After the above line, the DataPipe state becomes `Iterating` since single-process
# Now I create DataLoader with multiple workers
dl = DataLoader(dp, num_workers=2)
_ = list(dl)
# The DataPipe state transits from `Iterating` to `Restored` due to multiprocessing, then `reset` is not invoked
Technical speaking, this problem only happens with DataLoader
but not DataLoader2
since DLv2
should always use a copy of DataPipe
graph (even though it's not true for now). It would be great if we have a clear boundary for each state.
cc: @NivekT @VitalyFedyunin
Motivation, pitch
We need to clarify the state transition for both users and developers.
Alternatives
No response
Additional context
No response
I personally think so as fast forward would be the backup way to run snapshot. It would be better to use getstate to consolidate the state of DataPipe and restore it without wasting time fast forwarding.
Agree and that has been part of the plan. https://github.com/pytorch/pytorch/pull/79658
And, if so, should we make metaclass to automatically change the state when
__setstate__
is invoked?
That seems reasonable given that this proposal also change the first condition i.e. NotStarted
-> NotStarted
in the restoration process.
convert its state from Iterating to NotStarted after every iteration ends
This might work for now.
It gets trickier when you are restoring the buffers of container DataPipes with children (e.g. demux
, fork
). Sometimes one child may reach StopIteration
and the other child hasn't.
Currently, the Snapshot state of those containers aren't changing but that is fine because we are only using naive fast-forwarding. I believe it will have to check is_every_instance_exhausted
to determine if it is exhausted once we start restoring buffers directly. There may be some corner cases related to these, but they are not an immediate concern until I work on full snapshotting.
Overall, I think the proposal is good and it should work.
It gets trickier when you are restoring the buffers of container DataPipes with children (e.g.
demux
,fork
). Sometimes one child may reachStopIteration
and the other child hasn't.
Yeah, reasonable concern. And, I feel this is solvable as parent has aware of when it reaches end of iteration even though they don't have __iter__
function.
Closing it for now as I feel this has been discussed and landed