text icon indicating copy to clipboard operation
text copied to clipboard

Add transforms kwarg to Datasets

Open erip opened this issue 3 years ago • 2 comments

🚀 Feature

torchtext datasets should provide an optional transforms kwarg.

Motivation

Other domain libraries provide a transform and target_transform kwargs to datasets for common operations (e.g., resizing, scaling, and cropping images in torchvision and numericalizing the associated labels). torchtext should provide similar kwargs to support transforms for tokenization, numericalization, padding, etc. Currently this must happen within the collate_fn or similar, but it is somewhat awkward because it forces the collate_fn to have too many responsibilities.

Pitch

If provided, transforms and target_transforms should be applied in the __getitem__ or __iter__ for Datasets as defined within torchtext. These transforms already exist both stably and experimentally.

Alternatives

Status quo -- don't add the kwargs.

Additional context

N/A

erip avatar Jan 01 '22 21:01 erip

It seems like once torchdata adoption lands we could just apply a map datapipe to transform the data. I'll keep this open until we've cut over to the datapipes API and torchdata.

erip avatar Jan 07 '22 16:01 erip

It seems like once torchdata adoption lands we could just apply a map datapipe to transform the data. I'll keep this open until we've cut over to the datapipes API and torchdata.

That's right! I am working on tutorial PR (https://github.com/pytorch/text/pull/1468) that shows how it is done on datapipes. Feel free to share your feedback in there :) https://github.com/pytorch/text/blob/ca48ff66e6af66fb14b4f99b02f04354bf48bbac/examples/tutorials/sst2_classification_non_distributed.py

parmeet avatar Jan 07 '22 17:01 parmeet