text icon indicating copy to clipboard operation
text copied to clipboard

How to split `_RawTextIterableDataset`

Open KickItLikeShika opened this issue 3 years ago • 8 comments

❓ Questions and Help

I am trying to move from using legacy and use new provided features, i was doing this:

from torchtext import legacy
TEXT = legacy.data.Field(lower=True, batch_first=True)
LABEL = legacy.data.LabelField(dtype=torch.float)
train_data, test_data = legacy.datasets.IMDB.splits(TEXT, LABEL, root='/tmp/imdb/')
train_data, valid_data = train_data.split(split_ratio=0.8, random_state=random.seed(SEED))

But now i want to split train_data, how can i do that?

from torchtext.datasets import IMDB
train_iter, test_iter = IMDB(split=('train', 'test'))
# I need to split train_iter into train_iter and valid_iter

And i think providing more features more than just this one would help more, Thanks!

KickItLikeShika avatar Mar 30 '21 15:03 KickItLikeShika

It's an iterator so I don't think you can split/shuffle it. I think, it's worth an option to set up the offset or the beginning of line. So for the valid set, you can start from a different line. cc @cpuhrsch @parmeet

zhangguanheng66 avatar Mar 30 '21 15:03 zhangguanheng66

As a temporary solution, you can list all the items from the iterator and split them (if the dataset can fit the memory).

train_data = list(train_iter)

zhangguanheng66 avatar Mar 30 '21 15:03 zhangguanheng66

Thanks! but i think if we did that we can not get it back to _RawTextIterableDataset, right?

KickItLikeShika avatar Mar 30 '21 16:03 KickItLikeShika

Thanks! but i think if we did that we can not get it back to _RawTextIterableDataset, right?

I don't understand it. Here, we just cache the iterator. Search train_list = list(train_iter) in this tutorial.

zhangguanheng66 avatar Mar 30 '21 17:03 zhangguanheng66

@KickItLikeShika Thank you for bringing this up. We will discuss this request internally and get back to you shortly!

parmeet avatar Mar 31 '21 19:03 parmeet

We have migrated our datasets on top of torchdata datapipes. @ejguan I wonder if there is an elegant way to split datapipe into two random disjoint datapipes?

parmeet avatar Jun 23 '22 21:06 parmeet

it depends on how you want to split. For a simple case, you can use demux to split based on the indices generated by enumerating from the prior DataPipe.

ejguan avatar Jun 24 '22 14:06 ejguan

could you provide code example? split the train_iter of iMDB into train and validation

enigdata avatar Jul 30 '23 03:07 enigdata