text icon indicating copy to clipboard operation
text copied to clipboard

Torchtext Dataset/Dataloader with generator

Open dipta007 opened this issue 3 years ago • 1 comments

❓ Questions and Help

Description

For a large corpus, I couldn't find any way to use an iterator in the dataset like the PyTorch dataset. Is it possible to make a dataset from only the generator or implement something like a PyTorch dataset object which will dynamically pull the data?

dipta007 avatar Jun 12 '22 19:06 dipta007

Hi @dipta007, In torchtext 0.12 we have migrated our datasets on top of torchdata. You can look at datasets implementation that offer plenty of examples or refer the torchdata documentation for additional information on usage and available functionality in datapipes.

In general, datapipes offer constructing iterable Datasets and can be used with large corpus. For instance, unlike Map Style datasets, you do not have to read the whole data into memory to work with Datapipes. They work more like in streaming fashion.

parmeet avatar Jun 13 '22 15:06 parmeet