pytorch-seq2seq icon indicating copy to clipboard operation
pytorch-seq2seq copied to clipboard

Custom Text Dataset

Open moodhiaj opened this issue 2 years ago • 6 comments

I am trying to work on my own data in a txt file the source and target sentences are separated by a tab. The problem is I'm not able to use Field and this created many issues in the code for me. Any help please how can I use my data in field??

moodhiaj avatar Mar 26 '22 20:03 moodhiaj

i also want to ask this question!

wusuhuang avatar Mar 29 '22 02:03 wusuhuang

If someone is looking for the answer, here what I did and worked for me: `tokenize = lambda x:x.split(' ') SRC = Field(tokenize = tokenize) TRG = Field(tokenize = tokenize,) fields = {'Source': ('src',SRC), 'Target': ('trg',TRG)} train_data, valid_data, test_data = torchtext.legacy.data.TabularDataset.splits( path = '', train = 'My_train_Set.csv', test = 'My_test_set.csv', validation = 'My_Validation_Set.csv', format = 'csv', fields = fields) SRC.build_vocab(train_data, min_freq=2) TRG.build_vocab(train_data, min_freq=2) BATCH_SIZE = 128

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

train_iterator, valid_iterator, test_iterator = BucketIterator.splits( (train_data, valid_data, test_data), batch_size = BATCH_SIZE, sort_within_batch = True, sort_key = lambda x : len(x.src), device = device)`

moodhiaj avatar Mar 29 '22 19:03 moodhiaj

thank you very much. I also want to know how to store source and target sentence in CSV file. They are paired sentences. 

---Original--- From: "Moodhi @.> Date: Wed, Mar 30, 2022 03:57 AM To: @.>; Cc: @.@.>; Subject: Re: [bentrevett/pytorch-seq2seq] Custom Text Dataset (Issue #183)

If someone is looking for the answer, here what I did and worked for me: `tokenize = lambda x:x.split(' ') SRC = Field(tokenize = tokenize) TRG = Field(tokenize = tokenize,) fields = {'Source': ('src',SRC), 'Target': ('trg',TRG)} train_data, valid_data, test_data = torchtext.legacy.data.TabularDataset.splits( path = '', train = 'My_train_Set.csv', test = 'My_test_set.csv', validation = 'My_Validation_Set.csv', format = 'csv', fields = fields) SRC.build_vocab(train_data, min_freq=2) TRG.build_vocab(train_data, min_freq=2) BATCH_SIZE = 128

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

train_iterator, valid_iterator, test_iterator = BucketIterator.splits( (train_data, valid_data, test_data), batch_size = BATCH_SIZE, sort_within_batch = True, sort_key = lambda x : len(x.src), device = device)`

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>

wusuhuang avatar Mar 30 '22 00:03 wusuhuang

I don't know how your data structured but mine was originally in Excel files so I didn't have any problems converting them to CSV.

moodhiaj avatar Mar 30 '22 10:03 moodhiaj

can you tell me how to make your own data of the csv format?

wusuhuang avatar Apr 01 '22 01:04 wusuhuang

Thanks for this great solution. Using model with custom dataset is always a big bored and irritable problem

tuzeao avatar Jun 30 '22 03:06 tuzeao