data
data copied to clipboard
[RFC] Restricting `IterDataPipe` to have method `__iter__` as a generator function without method `__next__`
🚀 The feature
** Note that this is a RFC to solely discuss the design. There is currently no plan to implement this feature. This issue serves as a developer documentation of the current design and the complexity/issue that we encounter with certain aspects of IterDataPipe
. It also provides a space to discuss what we can potentially do.
The overarching goal is to simplify certain aspects of IterDataPipe
while providing flexibility for users.
The proposed feature is to restrict IterDataPipe
, such that it must have a method __iter__
that is a generator function and it cannot have the method __next__
. All built-in IterDataPipe
is already implemented that way, so this will only impact custom IterDataPipe
that users create.
Alternate solutions are also discussed below. We welcome suggestions as well!
Motivation, pitch
For context, currently, there are 3 main types of IterDataPipe
that is allowed. The ones with:
-
__iter__
is a generator function (e.g. useyield
) -
__iter__
that returns an iterator but is not a generator function -
__iter__
returnsself
and a__next__
method exists
Note that it is possible for users to have __next__
but not have __iter__
returning self
, but that is not recommended and have unexpected behaviors. All built-in DataPipes belong to type 1.
The fact that there are 3 types of IterDataPipe
makes the implementation of hook_iterator
very complicated.
The hook is called every time __iter__
of an IterDataPipe
is invoked. The hook tries to do a few things:
- Enforce the single iterator per
IterDataPipe
constraint (seeoperations to related tovalid_iterator_id
) and reset the DataPipe as needed - Count the number of elements yielded
- Allow performance profiling of operations
The fact that there is no restriction on how users can implement __iter__
and __next__
for custom DataPipes means hook_iterator
must be complicated in order to handle the many corner cases that can happen. As you can see, we have a long code block to manage the behavior of type 1, and have a custom class to manage the behavior of type 2 and 3. The behavior of the method __next__
(type 3) is difficult to control and can lead to unexpected behaviors if users aren't careful.
If we are able to restrict IterDataPipe
, the implementation of those functionalities within hook_iterator
will be much cleaner at the cost of providing less flexibility for IterDataPipe
. I believe users also will be less likely to run into errors if we have such restriction.
Alternatives
Suggestion from @ejguan:
Create a class called DataPipeIterator
, which contains __self__
and __next__
. __iter__
from DataPipe always return a specific DataPipeIterator object. This might resolve the most of our problem.
Additional context
Such restriction will likely break some downstream usages. Whatever we do, we will proceed carefully.
Performance impact is also an aspect that we must consider as well.
Feedback and suggestions are more than welcomed. Let us know if you have experienced issues while using torchdata
or have a bad experience while implementing new features.