text
text copied to clipboard
Add transforms kwarg to Datasets
🚀 Feature
torchtext datasets should provide an optional transforms kwarg.
Motivation
Other domain libraries provide a transform and target_transform kwargs to datasets for common operations (e.g., resizing, scaling, and cropping images in torchvision and numericalizing the associated labels). torchtext should provide similar kwargs to support transforms for tokenization, numericalization, padding, etc. Currently this must happen within the collate_fn or similar, but it is somewhat awkward because it forces the collate_fn to have too many responsibilities.
Pitch
If provided, transforms and target_transforms should be applied in the __getitem__ or __iter__ for Datasets as defined within torchtext. These transforms already exist both stably and experimentally.
Alternatives
Status quo -- don't add the kwargs.
Additional context
N/A
It seems like once torchdata adoption lands we could just apply a map datapipe to transform the data. I'll keep this open until we've cut over to the datapipes API and torchdata.
It seems like once torchdata adoption lands we could just apply a map datapipe to transform the data. I'll keep this open until we've cut over to the datapipes API and torchdata.
That's right! I am working on tutorial PR (https://github.com/pytorch/text/pull/1468) that shows how it is done on datapipes. Feel free to share your feedback in there :) https://github.com/pytorch/text/blob/ca48ff66e6af66fb14b4f99b02f04354bf48bbac/examples/tutorials/sst2_classification_non_distributed.py