text
text copied to clipboard
Add `LengthSetterIterDataPipe` to all torchtext datasets
🚀 Feature
We want to add the LengthSetterIterDataPipe to the end of all torchtext datasets. This will allow us to call len() on the datapipe object and prevent errors like TypeError: DataPipe instance doesn't have valid length.
Motivation See https://github.com/pytorch/tutorials/pull/1954#discussion_r993951194 for discussion
Additional Context
Once this has been done for the Multi30k dataset, we can remove the conversion of the datapipe to a list in https://github.com/pytorch/tutorials/pull/1954 (i.e. list(train_dataloader)) since it would cause all data in the dataset to materialize. This can lead to OOMs for very large datasets.
Hello ! We are a group of students in second year in engineering school. We are currently interested in resolving this issue as a school project. Please let me know, if we can have your permision to contribute on this issue.
@moDallel - We welcome contributions! Thanks for your interest in the project.