data [RFC] Restricting `IterDataPipe` to have method `__iter__` as a generator function without method `__next_

[RFC] Restricting `IterDataPipe` to have method `iter` as a generator function without method `next`

Open NivekT opened this issue 2 years ago • 0 comments

🚀 The feature

** Note that this is a RFC to solely discuss the design. There is currently no plan to implement this feature. This issue serves as a developer documentation of the current design and the complexity/issue that we encounter with certain aspects of IterDataPipe. It also provides a space to discuss what we can potentially do.

The overarching goal is to simplify certain aspects of IterDataPipe while providing flexibility for users.

The proposed feature is to restrict IterDataPipe, such that it must have a method __iter__ that is a generator function and it cannot have the method __next__. All built-in IterDataPipe is already implemented that way, so this will only impact custom IterDataPipe that users create.

Alternate solutions are also discussed below. We welcome suggestions as well!

Motivation, pitch

For context, currently, there are 3 main types of IterDataPipe that is allowed. The ones with:

__iter__ is a generator function (e.g. use yield)
__iter__ that returns an iterator but is not a generator function
__iter__ returns self and a __next__ method exists

Note that it is possible for users to have __next__ but not have __iter__ returning self, but that is not recommended and have unexpected behaviors. All built-in DataPipes belong to type 1.

The fact that there are 3 types of IterDataPipe makes the implementation of hook_iterator very complicated.

The hook is called every time __iter__ of an IterDataPipe is invoked. The hook tries to do a few things:

Enforce the single iterator per IterDataPipe constraint (seeoperations to related to valid_iterator_id) and reset the DataPipe as needed
Count the number of elements yielded
Allow performance profiling of operations

The fact that there is no restriction on how users can implement __iter__ and __next__ for custom DataPipes means hook_iterator must be complicated in order to handle the many corner cases that can happen. As you can see, we have a long code block to manage the behavior of type 1, and have a custom class to manage the behavior of type 2 and 3. The behavior of the method __next__ (type 3) is difficult to control and can lead to unexpected behaviors if users aren't careful.

If we are able to restrict IterDataPipe, the implementation of those functionalities within hook_iterator will be much cleaner at the cost of providing less flexibility for IterDataPipe. I believe users also will be less likely to run into errors if we have such restriction.

Alternatives

Suggestion from @ejguan: Create a class called DataPipeIterator, which contains __self__ and __next__. __iter__ from DataPipe always return a specific DataPipeIterator object. This might resolve the most of our problem.

Additional context

Such restriction will likely break some downstream usages. Whatever we do, we will proceed carefully.

Performance impact is also an aspect that we must consider as well.

Feedback and suggestions are more than welcomed. Let us know if you have experienced issues while using torchdata or have a bad experience while implementing new features.

Jun 30 '22 20:06 NivekT

data data copied to clipboard

[RFC] Restricting `IterDataPipe` to have method `__iter__` as a generator function without method `__next__`

🚀 The feature

Motivation, pitch

Alternatives

Additional context

data
data copied to clipboard

[RFC] Restricting `IterDataPipe` to have method `iter` as a generator function without method `next`