text icon indicating copy to clipboard operation
text copied to clipboard

Add `LengthSetterIterDataPipe` to all torchtext datasets

Open Nayef211 opened this issue 3 years ago • 2 comments

🚀 Feature

We want to add the LengthSetterIterDataPipe to the end of all torchtext datasets. This will allow us to call len() on the datapipe object and prevent errors like TypeError: DataPipe instance doesn't have valid length.

Motivation See https://github.com/pytorch/tutorials/pull/1954#discussion_r993951194 for discussion

Additional Context Once this has been done for the Multi30k dataset, we can remove the conversion of the datapipe to a list in https://github.com/pytorch/tutorials/pull/1954 (i.e. list(train_dataloader)) since it would cause all data in the dataset to materialize. This can lead to OOMs for very large datasets.

Nayef211 avatar Oct 13 '22 02:10 Nayef211

Hello ! We are a group of students in second year in engineering school. We are currently interested in resolving this issue as a school project. Please let me know, if we can have your permision to contribute on this issue.

moDallel avatar Nov 25 '22 15:11 moDallel

@moDallel - We welcome contributions! Thanks for your interest in the project.

joecummings avatar Nov 28 '22 16:11 joecummings