text
text copied to clipboard
How to split `_RawTextIterableDataset`
❓ Questions and Help
I am trying to move from using legacy
and use new provided features, i was doing this:
from torchtext import legacy
TEXT = legacy.data.Field(lower=True, batch_first=True)
LABEL = legacy.data.LabelField(dtype=torch.float)
train_data, test_data = legacy.datasets.IMDB.splits(TEXT, LABEL, root='/tmp/imdb/')
train_data, valid_data = train_data.split(split_ratio=0.8, random_state=random.seed(SEED))
But now i want to split train_data, how can i do that?
from torchtext.datasets import IMDB
train_iter, test_iter = IMDB(split=('train', 'test'))
# I need to split train_iter into train_iter and valid_iter
And i think providing more features more than just this one would help more, Thanks!
It's an iterator so I don't think you can split/shuffle it. I think, it's worth an option to set up the offset or the beginning of line. So for the valid set, you can start from a different line. cc @cpuhrsch @parmeet
As a temporary solution, you can list all the items from the iterator and split them (if the dataset can fit the memory).
train_data = list(train_iter)
Thanks! but i think if we did that we can not get it back to _RawTextIterableDataset
, right?
Thanks! but i think if we did that we can not get it back to
_RawTextIterableDataset
, right?
I don't understand it. Here, we just cache the iterator. Search train_list = list(train_iter)
in this tutorial.
@KickItLikeShika Thank you for bringing this up. We will discuss this request internally and get back to you shortly!
We have migrated our datasets on top of torchdata datapipes. @ejguan I wonder if there is an elegant way to split datapipe into two random disjoint datapipes?
it depends on how you want to split. For a simple case, you can use demux
to split based on the indices generated by enumerating from the prior DataPipe.
could you provide code example? split the train_iter of iMDB into train and validation