text
text copied to clipboard
Add bucket sampler
🚀 Feature
Motivation
The legacy BucketIterator was convenient because it could batch samples by length to minimize padding. It had many disadvantages because of its API and non-comformance with other parts of the pytorch data{sets,loader} ecosystem. It would be nice if torchtext supported the spirit of the BucketIterator by way of a Sampler.
Pitch
A sampler with the ability to specify maximum bucket size should be added similar to those in torchnlp and allennlp. This can be used with existing datasets in torchtext but as a kwarg to the pytorch DataLoader so sampling minimizes padding.
Alternatives
Users who want this functionality need to implement their own samplers.
Additional context
The migration guide contains a prototype of this feature without it being a first-class part of the torchtext repo. A proposed implementation can be found here.
Thanks @erip for raising this issue. This indeed is one of the important issues to be resolved and we understand that many users rely on this feature. We are currently working on updating torchtext datasets to build on torchdata DataPipes. Fortunately, the equivalent functionality is already implemented here, hence users should be able to sample batches with similar lengths out-of-the-box.
We will add a tutorial to demonstrate this functionality. Let me know if you would be interested in making this contribution and I can help with the code pointers :)
I'd definitely be interested in contributing!
As an aside, I've also got a max token batch sampler which is a bit different, but may be of interest. Not sure if it makes sense to include it in torchtext, but if you think it's worth a discussion I'd be happy to contribute it. The one hitch is that it (like all other samplers) requires map-style datasets.
Hi @erip, in torchaudio I implemented a BucketizeSampler that has max_token_count as the argument. I'm working with torchdata developers to add such support to the bucketize DataPipe. I'm happy to hear your suggestions on the API design so that this bucketize DataPipe can be useful for both NLP and audio domains :)
@nateanl thanks for the pointer! I think it looks pretty good, but a couple of questions:
- it looks like minibatch shuffling happens unconditionally. Does it make sense to add a shuffle kwarg to the sampler?
- I didn't realize Datasets had a len_list attr and I can't find that in the pytorch source. Is that populated somewhere in torchaudio?
- one neat trick to promote diversity in batches is to add some small random noise to the lengths of samples so you don't see the same examples each epoch. My gist has an impl of that -- WDYT about adding an optional feature for that in this notional upstream sampler?
The len_list is in the custom HuBERT dataset, I think your impl is better, we should assume the dataset is unsorted and sort the indices by passing the unsorted lengths list.
I'm not sure shuffle the samples in the batch will improve the training, will the gradient differ by shuffling? What I did is shuffle the samples in the same bucket so that you can get different samples every epoch. I don't shuffle the order of buckets because that may introduce a large length gap in the same batch, when transferring to a new bucket.
In terms of creating the buckets, I use num_buckets as the argument and set the boundaries by interval = (max_len - min_len) // num_buckets. There are other ways to do it, for example, set boundaries as the argument. Which one do you think is better for use?
I'm not sure shuffle the samples in the batch will improve the training
I'm also not sure of this, but it seems to be happening in the torchaudio code. It might be worth a small test -- something tells me having unshuffled examples might lead to overfitting, but that's just a hunch. In any case, it might be unexpected to shuffle examples within a batch, so having an optional arg might be OK if the shuffling will be kept.
Which one do you think is better for use?
I think your approach is quite elegant. One idea might be to make max_len and min_len optional arguments so you can filter the data to avoid highly skewed length distributions, but it's not obvious whether this filtering should happen in the dataset code or in some preprocessing step. If they're not provided you can infer them as in your sampler. In MT we often set min and max lengths of subword tokenized sentences, but that could plausibly happen at time of training the subword model. Not sure what the best answer is.
random.shuffle(buckets[k])
Yeah this shuffling will generate different batches, assume the bucket size is greater than batch_size, but shuffle the samples in the same batch may not change the gradient, what do you think ;)
max_len and min_len sounds good to me. Does it mean the samples whose length is < min_len or > max_len will be filtered out? If so, we can add such filtering to the new DataPipes code.
Yeah this shuffling will generate different batches
Ah, I misread the code -- this makes total sense. I agree that this will have more impact than shuffling within a batch. I still think it is good to give an option to the user about whether there should be shuffling at all. maybe just shuffle: bool = True by default?
Does it mean the samples whose length
Yes, exactly. See here. There's probably some error checking to add here (what if lengths is the empty list after filtering? 🙀 ), but otherwise seems OK.