text
text copied to clipboard
__getitem()__ not implemented?
❓ Questions and Help
Description
For some reason, calling getitem() on the Torchtext Multi30k dataset returns a NotImplementedError for me, despite the dataset being properly downloaded and calling next(iter()) on it providing valid output. Can someone help me understand this? I need the method as I'm wrapping the dataset in a larger dataset class and will have to call getitem() explicity to perform joint pre-processing with other dataset products.
Sample
m30k = torchtext.datasets.Multi30k(root='.\Data', split='test', language_pair=('en', 'de')) ; m30k.__getitem__(0)
Multi30k is (and, indeed, all torchtext datasets are) iterable-style and therefore does not implement __getitem__. You can convert it to a map-style dataset (which implements __getitem__) by using torchtext.data.functional.to_map_style_dataset:
>>> import torchtext
>>> m30k = torchtext.datasets.Multi30k(root='.\Data', split='test', language_pair=('en', 'de'))
>>> map_m30k = torchtext.data.functional.to_map_style_dataset(m30k)
>>> map_m30k[0]
('A man in an orange hat starring at something.\n', 'Ein Mann mit einem orangefarbenen Hut, der etwas anstarrt.\n')
Thanks @erip for the reply. Please note that this is an experimental functionality (https://github.com/pytorch/text/blob/a2ab9741415b2cff026d158a5a54b62b993571d9/torchtext/data/functional.py#L15). Also in light of migration (#1494), I wonder if there is a better way to do it such that we do not lose the datapipe properties. @ejguan Is there provision to support__getitem__ for datapipe (even if it mean materializing the whole dataset)?
It's doable using MapDataPipe but it's a different concept. And, it's currently the second citizen for TorchData as we are recommending using IterDataPipe in favor of streaming especially for the large Dataset.
- https://github.com/pytorch/pytorch/tree/master/torch/utils/data/datapipes/map
And, there are ways to convert from MapDataPipe to IterDataPipe, or collaborate between two types of DataPipes like https://github.com/pytorch/data/blob/24c25c030e1fce6c75c41b18c837c049a14410f1/torchdata/datapipes/iter/util/combining.py#L90
Tbh, we don't have the plan to add __getitem__ to IterDataPipe as __getitem__ doesn't fit the streaming manner, which always requires to materialize the data or objects for each index within the DataPipe instance.
Can I understand the reasoning behind implementing torchtext datasets as iterable-style instead of map-style? Many significantly larger image datasets (such as Imagenet and CIFAR-10) are implemented as iterable-style in torchvision (indeed loading the entire dataset into memory is not a requirement of the iterable-style anyway), and I'm not sure why batch size would be element dependent in this case. Those are really the only two cases where it seems convention denotes an iterable-style dataset be used.
Batch size can certainly be element dependent in NLP cases where you may want to form batches based on the length of examples (like max-token post-pad batching).
Some datasets in torchtext are modestly sized, but others (like CC100 soon) are significantly larger and iterable-style is the only way to realistically consume them. Additionally, datapipes in the pytorch ecosystem prefer iterable-style which enables slightly cleaner and intent-revealing semantics at the dataset level (vs. at the loader level).