Kashgari
Kashgari copied to clipboard
Seq2seq model tend to repeat
You must follow the issue template and provide as much information as possible. otherwise, this issue will be closed. 请按照 issue 模板要求填写信息。如果没有按照 issue 模板填写,将会忽略并关闭这个 issue
Check List
Thanks for considering to open an issue. Before you submit your issue, please confirm these boxes are checked.
You can post pictures, but if specific text or code is required to reproduce the issue, please provide the text in a plain text format for easy copy/paste.
- [x] I have searched in existing issues but did not find the same one.
- [x] I have read the documents
Environment
colab
Question
When I use seq2seq model, I found it tend to repeat some words and generate sentences that are greatly different from expected (like y_original in the example file). e.g. y_original :[ 'مەن كىم ؟', 'مەن كېسەل.' ,'مەن سىزنى ياخشى كۆرمەن' , 'ماڭا ياردەم كېرەك.' , 'ئاغىرىشى مۇمكىن.', 'خەيىرلىك ئەتىگەن.'] model output: [['كىم', 'كىم'], ['كۆرمەن', 'كېسەل'], ['كۆرمەن', 'كېسەل'], ['كىم', 'كىم'], ['كىم', 'ياردەم'], ['كىم', 'كىم']]
Wow. This is the first time I see some one used seq2seq on Uighur language generation. What kind tokenization process you have used? And please share the seq2seq definition code, and how long you have trained on the corpus.
Wow. This is the first time I see some one used seq2seq on Uighur language generation. What kind tokenization process you have used? And please share the seq2seq definition code, and how long you have trained on the corpus.
It just uses the example in kashgari Seq2seq tutorial. The tokenizer I tried basic Bert, Ernie, and the result is similar. I tried the seq2seq in text2sql task, the problem happens as well. It tends to repeat and generate sentences that are irrelevant with train_y.
Wow. This is the first time I see some one used seq2seq on Uighur language generation. What kind tokenization process you have used? And please share the seq2seq definition code, and how long you have trained on the corpus.
It just uses the example in kashgari Seq2seq tutorial.
The tokenizer I tried basic Bert, Ernie, and the result is similar.
I tried the seq2seq in text2sql task, the problem happens as well. It tends to repeat and generate sentences that are irrelevant with train_y.
how many sentences you used for training.
Wow. This is the first time I see some one used seq2seq on Uighur language generation. What kind tokenization process you have used? And please share the seq2seq definition code, and how long you have trained on the corpus.
It just uses the example in kashgari Seq2seq tutorial. The tokenizer I tried basic Bert, Ernie, and the result is similar. I tried the seq2seq in text2sql task, the problem happens as well. It tends to repeat and generate sentences that are irrelevant with train_y.
how many sentences you used for training.
About 30000 sentences
It is most likely a tokenizer issue. You need to use a proper tokenizer so that your sentence's tokens match the embedding vocab.
For example, after you have loaded the embedding.
# lets assume vocab2idx is
embedding. vocab2idx = {
'A': 0,
'B': 1,
'C': 2,
'[UNK]': 3,
}
sentence = 'A A B C D E F G H'.split() # sentence to words list by space
token = embedding._text_processor.transform([sentence])
# What is expected is something like [0 0 1 2 4 5 6 7]
# but what actually retuns is [0 0 1 2 3 3 3 3 3 3]
So all the tokens not in embeddings vocab will translate into a predefined token, which caused this issue.
It is most likely a tokenizer issue. You need to use a proper tokenizer so that your sentence's tokens match the embedding vocab.
For example, after you have loaded the embedding.
# lets assume vocab2idx is embedding. vocab2idx = { 'A': 0, 'B': 1, 'C': 2, '[UNK]': 3, } sentence = 'A A B C D E F G H'.split() # sentence to words list by space token = embedding._text_processor.transform([sentence]) # What is expected is something like [0 0 1 2 4 5 6 7] # but what actually retuns is [0 0 1 2 3 3 3 3 3 3]
So all the tokens not in embeddings vocab will translate into a predefined token, which caused this issue.
Got it! I will try to fix it. Thanks for your help!!
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.