pytorch-meta icon indicating copy to clipboard operation
pytorch-meta copied to clipboard

How to process the whole batch

Open minhtriet opened this issue 4 years ago • 1 comments

Currently I am relying on __get_item(self,index)__ to tokenize sentence index. However, there is a way to more effectively tokenize the whole batch, instead of individual sentence. Could this be done in pytorch-meta, I have yet found one in the examples.

minhtriet avatar Nov 12 '20 21:11 minhtriet

Most datasets currently available in Torchmeta rely on a hierarchy of three objects:

  • Dataset, which is simply a PyTorch dataset, which is responsible for getting the individual examples for a given label. For example it can be a dataset containing all the (20) examples of the letter A in Omniglot.
  • ClassDataset, which is producing the datasets for different classes. Each index of this class corresponds to a single label. For example in Omniglot, this contains 1028 elements, and class_dataset[0] returns an instance of Dataset (above) containing all the examples of images_background/Alphabet_of_the_Magi/character01.
  • CombinationMetaDataset combines multiple indices (for example (0, 1, 2, 3, 4)) to create a task over the corresponding labels, the individual indices corresponding to the ones in ClassDataset above.

Something you could do in your case is to tokenize all the elements of Dataset at once, because this is essentially a batch of data (from which the sampler is going to sample from to create the actual datasets for the task).

Another option could be to look into how to allow __getitem__(index) to get a batch (list) of indices for index. This is already possible in standard PyTorch datasets, and since Torchmeta datasets are essentially instances of PyTorch datasets this could be possible. I have tried to include that at some point in Torchmeta to improve sampling, but there was no particular improvement for image datasets, especially since processing by Torchvision transforms (e.g. image loading, Resize, etc...) only accept single images, so I ended up not continuing further.

tristandeleu avatar Nov 13 '20 12:11 tristandeleu