text Add transforms kwarg to Datasets

Add transforms kwarg to Datasets

Open erip opened this issue 3 years ago • 2 comments

🚀 Feature

torchtext datasets should provide an optional transforms kwarg.

Motivation

Other domain libraries provide a transform and target_transform kwargs to datasets for common operations (e.g., resizing, scaling, and cropping images in torchvision and numericalizing the associated labels). torchtext should provide similar kwargs to support transforms for tokenization, numericalization, padding, etc. Currently this must happen within the collate_fn or similar, but it is somewhat awkward because it forces the collate_fn to have too many responsibilities.

Pitch

If provided, transforms and target_transforms should be applied in the __getitem__ or __iter__ for Datasets as defined within torchtext. These transforms already exist both stably and experimentally.

Alternatives

Status quo -- don't add the kwargs.

Additional context

N/A

Jan 01 '22 21:01 erip

It seems like once torchdata adoption lands we could just apply a map datapipe to transform the data. I'll keep this open until we've cut over to the datapipes API and torchdata.

Jan 07 '22 16:01 erip

It seems like once torchdata adoption lands we could just apply a map datapipe to transform the data. I'll keep this open until we've cut over to the datapipes API and torchdata.

That's right! I am working on tutorial PR (https://github.com/pytorch/text/pull/1468) that shows how it is done on datapipes. Feel free to share your feedback in there :) https://github.com/pytorch/text/blob/ca48ff66e6af66fb14b4f99b02f04354bf48bbac/examples/tutorials/sst2_classification_non_distributed.py

Jan 07 '22 17:01 parmeet

text text copied to clipboard

Add transforms kwarg to Datasets

🚀 Feature

text
text copied to clipboard