data icon indicating copy to clipboard operation
data copied to clipboard

Chainer/Concater from single datapipe?

Open NicolasHug opened this issue 2 years ago • 7 comments

The Concater datapipe takes multiple DPs as input. Is there a class that would take a single datapipe of iterables instead? Something like this:

class ConcaterIterable(IterDataPipe):
    def __init__(self, source_datapipe):
        self.source_datapipe = source_datapipe

    def __iter__(self):
        for iterable in self.source_datapipe:
            yield from iterable

Basically:

itertools.chain == Concater itertools.chain.from_iterable == ConcaterIterable

Maybe a neat way of implementing this would be to keep a single Concater class, which would fall back to the ConcaterIterable behaviour if it's passed only one DP as input?


Details: I need this for my benchmarking on manifold where each file is a big pickle archive of multiple images. My DP builder looks like this:

def make_manifold_dp(root, dataset_size):
    handler = ManifoldPathHandler()
    dp = IoPathFileLister(root=root)
    dp.register_handler(handler)

    dp = dp.shuffle(buffer_size=dataset_size).sharding_filter()

    dp = IoPathFileOpener(dp, mode="rb")
    dp.register_handler(handler)

    dp = PickleLoaderDataPipe(dp)
    dp = ConcaterIterable(dp)  # <-- Needed here!
    return dp

NicolasHug avatar Jul 13 '22 14:07 NicolasHug

BTW, this is a NIT, but has it been considered to rename Concater into Chainer to be a bit more consistent with itertools?

NicolasHug avatar Jul 13 '22 14:07 NicolasHug

You can try to use .unbatch() for it, it is not so generic but might work in your case.

However proper solution would be to add new DataPipe. And I would rather call it flatten

VitalyFedyunin avatar Jul 26 '22 15:07 VitalyFedyunin

Note: we already used flatten for horizontal/column operations. Perhaps we need rows_flatten or other (way) better name.

VitalyFedyunin avatar Aug 15 '22 16:08 VitalyFedyunin

You can try to use .unbatch() for it, it is not so generic but might work in your case.

However proper solution would be to add new DataPipe. And I would rather call it flatten

Looks like unbatch() works for a datapipe that contains lists, but it doesn't work for datapipes that contain datapipes, so in https://github.com/pytorch/data/issues/732 I still had to resort to something like ConcaterIterable above

NicolasHug avatar Aug 15 '22 17:08 NicolasHug

We will work to introduce the function for your case.

VitalyFedyunin avatar Aug 15 '22 17:08 VitalyFedyunin

Edit: I think the new operation flatten in this open PR should be able to handle a IterDataPipe of iterables, depending on how we end up implementing that.

NivekT avatar Aug 15 '22 18:08 NivekT

Looks like our final solution would be to allow flatmap to have no-op. Meanwhile, you can use:

dp = dp.flatmap(fn = lambda x: x)

VitalyFedyunin avatar Aug 16 '22 21:08 VitalyFedyunin

This can be closed.

SvenDS9 avatar Mar 10 '23 17:03 SvenDS9

Closing this. Feel free to re-open if necessary.

NivekT avatar Mar 14 '23 20:03 NivekT