data [RFC] More support for functionalities from `itertools`

🚀 The feature

Over time, we have received more and more request for additional IterDataPipe (e.g. #648, #754, plus many more). Sometimes, these functionalities are very similar to what is already implemented in itertools and more-itertools.

Keep adding more IterDataPipe one at a time seems unsustainable(?). Perhaps, we should draw a line somewhere or provide better interface for users to directly use functions from itertools. At the same time, providing APIs with names that are already familiar to Python users can improve the user experience. As @msaroufim mentioned, the Core library does aim to match operators with what is available in numpy.

We will need to decide on:

Coverage - which set of functionalities should we officially in torchdata?
Implementation - how will users be able to invoke those functions?

Coverage

Arbitrary based on estimated user requests/contributions
itertools ~20 functions (some of which already exist in torchdata)
- This seems common enough and reasonable?
more-itertools ~100 functions?
- This is probably too much.

If we provide a good wrapper, we might not need to worry about the actual coverage too much?

Implementation

Keep adding each function as a new IterDataPipe
- This is what we have been doing. We can keep doing that but the cost of maintenance will increase over time.

Currently, you can use IterableWrapper, but it doesn't always work well since it accepts an iterable, and an iterable doesn't guarantee to restart if you call iter() on it again.

from torchdata.datapipes.iter import IterableWrapper
from itertools import accumulate

source_dp = IterableWrapper(range(10))
dp3 = IterableWrapper(accumulate(source_dp), deepcopy=False)
list(dp3)  # [0, 1, 3, 6, 10, 15, 21, 28, 36, 45]
list(dp3)  # []

One idea to work around that is to:

Provide a different wrapper that accepts a Callable that returns an Iterable, which will be iterated over
- Users can use functool.partial to pass in arguments (including DataPipes if desired)
- I personally think we should do this since the cost of doing so is low and unlocks other possibilities.
Create an Itertools DataPipe that delegates other DataPipes, it might look some like this:

class ItertoolsIterDataPipe(IterDataPipe):

    supported_operations: Dict[str, Callable] = {
        "repeat": Repeater,
        "chain": Concater,
        "filterfalse": filter_false_constructor,
        # most/all 20 `itertools` functions here?
    }

    def __new__(cls, name, *args, **kwargs):
        if name not in cls.supported_operations:
            raise RuntimeError("Operator is not supported")
        constructor = cls.supported_operations[name]
        return constructor(*args, **kwargs)

source_dp = IterableWrapper(range(10))
dp1 = source_dp.filter(lambda x: x >= 5)
dp2 = ItertoolsIterDataPipe("filterfalse", source_dp, lambda x: x >= 5)

list(dp1)  # [5, 6, 7, 8, 9]
list(dp2)  # [0, 1, 2, 3, 4]

These options are incomplete. If you have more ideas, please comment below.

Motivation, pitch

These functionalities are commonly used and can be valuable for users.

Additional context

Credit to @NicolasHug @msaroufim @pmeier and many others for past feedback and discussion related to this topic.

cc: @VitalyFedyunin @ejguan

Aug 30 '22 21:08 NivekT

Thanks for opening this issue @NivekT !

Regarding coverage I think this is the kind of things you want to figure out organically, mostly driven by user-requests (i.e. option 0 above). If a user comes with a need, hopefully there will already be a nice canonical way to compose a few datapipes to achieve what they want to do. If it's not super easy to invent or discover, it might be worth documenting. If it's too complex / easy to get wrong, or if it just doesn't exist yet and is in scope, then it's worth considering implementing as a torchdata built-in.

At the same time, providing APIs with names that are already familiar to Python users can improve the user experience

Also echoing @msaroufim https://github.com/pytorch/data/issues/754#issuecomment-1232041584 , naming consistency with the builtin itertools and possibly other similar packages in the ecosystem is important. It reduces transition friction, increases changes of adoption, and makes all APIs more discoverable.

Aug 31 '22 09:08 NicolasHug

hopefully there will already be a nice canonical way to compose a few datapipes to achieve what they want to do

#757 might allow users to more easily use existing functions from other libraries to compose a IterDataPipe that they would like.

naming consistency with the builtin itertools and possibly other similar packages in the ecosystem is important.

Agree and our team should have a look soon.

Sep 01 '22 22:09 NivekT

I generally have concern on the snapshotting with itertools as it heavily uses generator function that won't be serializable. What's your opinion since you might have better knowledge on snapshotting?

From API perspective, as we started torchdata, we kind tried our best to mimic the api toward itertools to reduce friction and we should continue to maintain it this way.

Sep 06 '22 14:09 ejguan

That's a valid concern. I believe if the function is stateless (i.e. the next output only depends on the next input) then it is fine.

Otherwise, it will have to fall back onto naive snapshot restoration (i.e. counting the number of elements yielded).

Sep 07 '22 23:09 NivekT

I don't think the proposal was ever to wrap itertools inside the pipes, but rather to mirror the functionality and naming. So if the function is indeed stateless, we can use itertools, but otherwise nothing stops us from implementing the functionality ourselves. cc @NicolasHug

Sep 08 '22 06:09 pmeier

data data copied to clipboard

[RFC] More support for functionalities from `itertools`

🚀 The feature

Coverage

Implementation

Motivation, pitch

Additional context

data
data copied to clipboard