data
data copied to clipboard
Add Examples of Common Preprocessing Steps with IterDataPipe (such as splitting a data set into two)
📚 The doc issue
There are a few common steps that users often would like to do while preprocessing data, such as splitting their data set into train and eval. There are documentation in PyTorch Core about how to do these things with Dataset
. We should add the same to our documentation, specifically for IterDataPipe
. Or create a link to PyTorch Core's documentation for reference when that is appropriate. This issue is driven by common questions we have received either in person or on the forum.
If we find that any functionality is missing for IterDataPipe
, we should implement them.
Is there a general method? I currently implement an IterDataPipe that splits the dataset by index:
@functional_datapipe("index_split")
class IndexSpliterIterDataPipe(IterDataPipe):
def __init__(self, source_dp, start_idx, end_idx) -> None:
super().__init__()
self.source_dp = source_dp
self.start_idx = start_idx
self.end_idx = end_idx
assert self.end_idx > self.start_idx
def __iter__(self):
source_data = copy.deepcopy(self.source_dp)
source_data = iter(source_data)
for _ in range(self.start_idx):
next(source_data)
for _ in range(self.start_idx, self.end_idx):
yield next(source_data)
def __len__(self):
return self.end_idx - self.start_idx
If the total number cannot be obtained (ie end_idx is unknown), the following IterDataPipe can be used:
@functional_datapipe("index_split")
class IndexSpliterIterDataPipe(IterDataPipe):
def __init__(self, source_dp, start_idx=0, end_idx=-1) -> None:
super().__init__()
self.source_dp = source_dp
self.start_idx = start_idx
self.end_idx = end_idx
assert self.end_idx == -1 or self.end_idx > self.start_idx
def __iter__(self):
source_data = copy.deepcopy(self.source_dp)
source_data = iter(source_data)
for _ in range(self.start_idx):
next(source_data)
if self.end_idx == -1:
for d in source_data:
yield d
else:
for _ in range(self.start_idx, self.end_idx):
yield next(source_data)
@ezeli I think having a custom DataPipe seems fine for your use case. I am open to add that to the library if there are more users who have a use case for it.
We currently have .header(limit=10)
which can yield from the start up to the specified limit, but not the exact one that you'd like.
If you prefer to use built-in DataPipes instead, you can use one of:
-
dp = dp.enumerate().filter(filter_fn)
- if you want to discard every sample outside of the index -
dp1, dp2 = dp.enumerate().demux(classifier_fn)
- if you want to split samples into two DataPipes
In fact, it is not advisable to split the dataset by demux. Because the training set, validation set, and test set are often very large, it is easy to exceed the buffer_size, and if the buffer_size is set too large, it will lose the meaning of IterData and will put pressure on the memory.
I second what @ezeli said. Using demux I often exceed the buffer_size on my datasets. Furthermore it's not always straightforward to define a classifier function to split datasets according to one's needs.
The proposed IndexSpliterIterDataPipe
from @ezeli works flawlessly for my use cases.
It would be more convenient to have a similar DataPipe directly available in the library.
Because the training set, validation set, and test set are often very large, it is easy to exceed the buffer_size, and if the buffer_size is set too large, it will lose the meaning of IterData and will put pressure on the memory.
Share my 2 cents here.
Your strategy to split relies on skipping the element with index smaller than self.start_idx
. As you mentioned about large size of sub-set, it means you have to skip a bunch of elements, which is time consuming and wasting all the operations.
Besides, demux
provides a way to split dataset not just based on a range of indices. We might consider to add a way to cache the buffer
to local files to prevent memory blowup.
@ejguan It does take some time to skip some elements, a little trick is to put the small dataset in the front, such as the validation set and test set, and the training set in the back, so that it will consume less time.
Cached via a local file is a good idea, but if the dataset is split randomly, traversing one of the datasets often means caching all the other datasets. For example, when traversing the training set, most of the validation and test sets are cached because they are randomly distributed. Going through the validation set first would be a disaster, meaning most of the training set would need to be cached.
If so, I might as well split the train, validation and test sets into three files ahead of time.
a little trick is to put the small dataset in the front, such as the validation set and test set, and the training set in the back, so that it will consume less time.
If that's the case, it's even easier and more efficient if you can split them into different files or archives.
if the dataset is split randomly, traversing one of the datasets often means caching all the other datasets.
I think @NivekT 's PR has provided our initial support on it. Another alternative would be creating a non-buffer demux
for you.
Should we close this Issue as https://github.com/pytorch/data/pull/843 is landed? Or, you want to have a specific tutorial about splitting datapipe
Yea let's close this. Thanks!