data
data copied to clipboard
Chainer/Concater from single datapipe?
The Concater
datapipe takes multiple DPs as input. Is there a class that would take a single datapipe of iterables instead? Something like this:
class ConcaterIterable(IterDataPipe):
def __init__(self, source_datapipe):
self.source_datapipe = source_datapipe
def __iter__(self):
for iterable in self.source_datapipe:
yield from iterable
Basically:
itertools.chain
== Concater
itertools.chain.from_iterable
== ConcaterIterable
Maybe a neat way of implementing this would be to keep a single Concater
class, which would fall back to the ConcaterIterable
behaviour if it's passed only one DP as input?
Details: I need this for my benchmarking on manifold where each file is a big pickle archive of multiple images. My DP builder looks like this:
def make_manifold_dp(root, dataset_size):
handler = ManifoldPathHandler()
dp = IoPathFileLister(root=root)
dp.register_handler(handler)
dp = dp.shuffle(buffer_size=dataset_size).sharding_filter()
dp = IoPathFileOpener(dp, mode="rb")
dp.register_handler(handler)
dp = PickleLoaderDataPipe(dp)
dp = ConcaterIterable(dp) # <-- Needed here!
return dp
BTW, this is a NIT, but has it been considered to rename Concater
into Chainer
to be a bit more consistent with itertools
?
You can try to use .unbatch()
for it, it is not so generic but might work in your case.
However proper solution would be to add new DataPipe. And I would rather call it flatten
Note: we already used flatten
for horizontal/column operations. Perhaps we need rows_flatten or other (way) better name.
You can try to use
.unbatch()
for it, it is not so generic but might work in your case.However proper solution would be to add new DataPipe. And I would rather call it
flatten
Looks like unbatch()
works for a datapipe that contains lists, but it doesn't work for datapipes that contain datapipes, so in https://github.com/pytorch/data/issues/732 I still had to resort to something like ConcaterIterable
above
We will work to introduce the function for your case.
Edit: I think the new operation flatten
in this open PR should be able to handle a IterDataPipe of iterables, depending on how we end up implementing that.
Looks like our final solution would be to allow flatmap
to have no-op.
Meanwhile, you can use:
dp = dp.flatmap(fn = lambda x: x)
This can be closed.
Closing this. Feel free to re-open if necessary.