data icon indicating copy to clipboard operation
data copied to clipboard

Add Examples of Common Preprocessing Steps with IterDataPipe (such as splitting a data set into two)

Open NivekT opened this issue 1 year ago • 7 comments

📚 The doc issue

There are a few common steps that users often would like to do while preprocessing data, such as splitting their data set into train and eval. There are documentation in PyTorch Core about how to do these things with Dataset. We should add the same to our documentation, specifically for IterDataPipe. Or create a link to PyTorch Core's documentation for reference when that is appropriate. This issue is driven by common questions we have received either in person or on the forum.

If we find that any functionality is missing for IterDataPipe, we should implement them.

NivekT avatar Aug 02 '22 23:08 NivekT

Is there a general method? I currently implement an IterDataPipe that splits the dataset by index:

@functional_datapipe("index_split")
class IndexSpliterIterDataPipe(IterDataPipe):
    def __init__(self, source_dp, start_idx, end_idx) -> None:
        super().__init__()
        self.source_dp = source_dp
        self.start_idx = start_idx
        self.end_idx = end_idx
        assert self.end_idx > self.start_idx

    def __iter__(self):
        source_data = copy.deepcopy(self.source_dp)
        source_data = iter(source_data)
        for _ in range(self.start_idx):
            next(source_data)
        for _ in range(self.start_idx, self.end_idx):
            yield next(source_data)

    def __len__(self):
        return self.end_idx - self.start_idx

If the total number cannot be obtained (ie end_idx is unknown), the following IterDataPipe can be used:

@functional_datapipe("index_split")
class IndexSpliterIterDataPipe(IterDataPipe):
    def __init__(self, source_dp, start_idx=0, end_idx=-1) -> None:
        super().__init__()
        self.source_dp = source_dp
        self.start_idx = start_idx
        self.end_idx = end_idx
        assert self.end_idx == -1 or self.end_idx > self.start_idx

    def __iter__(self):
        source_data = copy.deepcopy(self.source_dp)
        source_data = iter(source_data)
        for _ in range(self.start_idx):
            next(source_data)
        if self.end_idx == -1:
            for d in source_data:
                yield d
        else:
            for _ in range(self.start_idx, self.end_idx):
                yield next(source_data)

ezeli avatar Aug 08 '22 03:08 ezeli

@ezeli I think having a custom DataPipe seems fine for your use case. I am open to add that to the library if there are more users who have a use case for it.

We currently have .header(limit=10) which can yield from the start up to the specified limit, but not the exact one that you'd like.

If you prefer to use built-in DataPipes instead, you can use one of:

  • dp = dp.enumerate().filter(filter_fn) - if you want to discard every sample outside of the index
  • dp1, dp2 = dp.enumerate().demux(classifier_fn) - if you want to split samples into two DataPipes

NivekT avatar Aug 08 '22 16:08 NivekT

In fact, it is not advisable to split the dataset by demux. Because the training set, validation set, and test set are often very large, it is easy to exceed the buffer_size, and if the buffer_size is set too large, it will lose the meaning of IterData and will put pressure on the memory.

ezeli avatar Aug 09 '22 03:08 ezeli

I second what @ezeli said. Using demux I often exceed the buffer_size on my datasets. Furthermore it's not always straightforward to define a classifier function to split datasets according to one's needs.

The proposed IndexSpliterIterDataPipe from @ezeli works flawlessly for my use cases. It would be more convenient to have a similar DataPipe directly available in the library.

vincentFrancais avatar Aug 09 '22 12:08 vincentFrancais

Because the training set, validation set, and test set are often very large, it is easy to exceed the buffer_size, and if the buffer_size is set too large, it will lose the meaning of IterData and will put pressure on the memory.

Share my 2 cents here.

Your strategy to split relies on skipping the element with index smaller than self.start_idx. As you mentioned about large size of sub-set, it means you have to skip a bunch of elements, which is time consuming and wasting all the operations.

Besides, demux provides a way to split dataset not just based on a range of indices. We might consider to add a way to cache the buffer to local files to prevent memory blowup.

ejguan avatar Aug 09 '22 14:08 ejguan

@ejguan It does take some time to skip some elements, a little trick is to put the small dataset in the front, such as the validation set and test set, and the training set in the back, so that it will consume less time.

Cached via a local file is a good idea, but if the dataset is split randomly, traversing one of the datasets often means caching all the other datasets. For example, when traversing the training set, most of the validation and test sets are cached because they are randomly distributed. Going through the validation set first would be a disaster, meaning most of the training set would need to be cached.

If so, I might as well split the train, validation and test sets into three files ahead of time.

ezeli avatar Aug 10 '22 12:08 ezeli

a little trick is to put the small dataset in the front, such as the validation set and test set, and the training set in the back, so that it will consume less time.

If that's the case, it's even easier and more efficient if you can split them into different files or archives.

if the dataset is split randomly, traversing one of the datasets often means caching all the other datasets.

I think @NivekT 's PR has provided our initial support on it. Another alternative would be creating a non-buffer demux for you.

ejguan avatar Aug 10 '22 19:08 ejguan

Should we close this Issue as https://github.com/pytorch/data/pull/843 is landed? Or, you want to have a specific tutorial about splitting datapipe

ejguan avatar Oct 20 '22 15:10 ejguan

Yea let's close this. Thanks!

NivekT avatar Oct 20 '22 17:10 NivekT