pytorch-meta
pytorch-meta copied to clipboard
How to process the whole batch
Currently I am relying on __get_item(self,index)__
to tokenize sentence index
. However, there is a way to more effectively tokenize the whole batch, instead of individual sentence. Could this be done in pytorch-meta
, I have yet found one in the examples.
Most datasets currently available in Torchmeta rely on a hierarchy of three objects:
-
Dataset
, which is simply a PyTorch dataset, which is responsible for getting the individual examples for a given label. For example it can be a dataset containing all the (20) examples of the letterA
in Omniglot. -
ClassDataset
, which is producing the datasets for different classes. Each index of this class corresponds to a single label. For example in Omniglot, this contains 1028 elements, andclass_dataset[0]
returns an instance ofDataset
(above) containing all the examples ofimages_background/Alphabet_of_the_Magi/character01
. -
CombinationMetaDataset
combines multiple indices (for example(0, 1, 2, 3, 4)
) to create a task over the corresponding labels, the individual indices corresponding to the ones inClassDataset
above.
Something you could do in your case is to tokenize all the elements of Dataset
at once, because this is essentially a batch of data (from which the sampler is going to sample from to create the actual datasets for the task).
Another option could be to look into how to allow __getitem__(index)
to get a batch (list) of indices for index
. This is already possible in standard PyTorch datasets, and since Torchmeta datasets are essentially instances of PyTorch datasets this could be possible. I have tried to include that at some point in Torchmeta to improve sampling, but there was no particular improvement for image datasets, especially since processing by Torchvision transforms (e.g. image loading, Resize
, etc...) only accept single images, so I ended up not continuing further.