pytorch-seq2seq
pytorch-seq2seq copied to clipboard
Custom Text Dataset
I am trying to work on my own data in a txt file the source and target sentences are separated by a tab. The problem is I'm not able to use Field and this created many issues in the code for me. Any help please how can I use my data in field??
i also want to ask this question!
If someone is looking for the answer, here what I did and worked for me: `tokenize = lambda x:x.split(' ') SRC = Field(tokenize = tokenize) TRG = Field(tokenize = tokenize,) fields = {'Source': ('src',SRC), 'Target': ('trg',TRG)} train_data, valid_data, test_data = torchtext.legacy.data.TabularDataset.splits( path = '', train = 'My_train_Set.csv', test = 'My_test_set.csv', validation = 'My_Validation_Set.csv', format = 'csv', fields = fields) SRC.build_vocab(train_data, min_freq=2) TRG.build_vocab(train_data, min_freq=2) BATCH_SIZE = 128
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
train_iterator, valid_iterator, test_iterator = BucketIterator.splits( (train_data, valid_data, test_data), batch_size = BATCH_SIZE, sort_within_batch = True, sort_key = lambda x : len(x.src), device = device)`
thank you very much. I also want to know how to store source and target sentence in CSV file. They are paired sentences.
---Original--- From: "Moodhi @.> Date: Wed, Mar 30, 2022 03:57 AM To: @.>; Cc: @.@.>; Subject: Re: [bentrevett/pytorch-seq2seq] Custom Text Dataset (Issue #183)
If someone is looking for the answer, here what I did and worked for me: `tokenize = lambda x:x.split(' ') SRC = Field(tokenize = tokenize) TRG = Field(tokenize = tokenize,) fields = {'Source': ('src',SRC), 'Target': ('trg',TRG)} train_data, valid_data, test_data = torchtext.legacy.data.TabularDataset.splits( path = '', train = 'My_train_Set.csv', test = 'My_test_set.csv', validation = 'My_Validation_Set.csv', format = 'csv', fields = fields) SRC.build_vocab(train_data, min_freq=2) TRG.build_vocab(train_data, min_freq=2) BATCH_SIZE = 128
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
train_iterator, valid_iterator, test_iterator = BucketIterator.splits( (train_data, valid_data, test_data), batch_size = BATCH_SIZE, sort_within_batch = True, sort_key = lambda x : len(x.src), device = device)`
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>
I don't know how your data structured but mine was originally in Excel files so I didn't have any problems converting them to CSV.
can you tell me how to make your own data of the csv format?
Thanks for this great solution. Using model with custom dataset is always a big bored and irritable problem