Kashgari icon indicating copy to clipboard operation
Kashgari copied to clipboard

Seq2seq model tend to repeat

Open TechSang opened this issue 2 years ago • 7 comments

You must follow the issue template and provide as much information as possible. otherwise, this issue will be closed. 请按照 issue 模板要求填写信息。如果没有按照 issue 模板填写,将会忽略并关闭这个 issue

Check List

Thanks for considering to open an issue. Before you submit your issue, please confirm these boxes are checked.

You can post pictures, but if specific text or code is required to reproduce the issue, please provide the text in a plain text format for easy copy/paste.

Environment

colab

Question

When I use seq2seq model, I found it tend to repeat some words and generate sentences that are greatly different from expected (like y_original in the example file). e.g. y_original :[ 'مەن كىم ؟', 'مەن كېسەل.' ,'مەن سىزنى ياخشى كۆرمەن' , 'ماڭا ياردەم كېرەك.' , 'ئاغىرىشى مۇمكىن.', 'خەيىرلىك ئەتىگەن.'] model output: [['كىم', 'كىم'], ['كۆرمەن', 'كېسەل'], ['كۆرمەن', 'كېسەل'], ['كىم', 'كىم'], ['كىم', 'ياردەم'], ['كىم', 'كىم']]

TechSang avatar Jul 06 '21 06:07 TechSang

Wow. This is the first time I see some one used seq2seq on Uighur language generation. What kind tokenization process you have used? And please share the seq2seq definition code, and how long you have trained on the corpus.

BrikerMan avatar Jul 07 '21 01:07 BrikerMan

Wow. This is the first time I see some one used seq2seq on Uighur language generation. What kind tokenization process you have used? And please share the seq2seq definition code, and how long you have trained on the corpus.

It just uses the example in kashgari Seq2seq tutorial. The tokenizer I tried basic Bert, Ernie, and the result is similar. I tried the seq2seq in text2sql task, the problem happens as well. It tends to repeat and generate sentences that are irrelevant with train_y.

TechSang avatar Jul 07 '21 02:07 TechSang

Wow. This is the first time I see some one used seq2seq on Uighur language generation. What kind tokenization process you have used? And please share the seq2seq definition code, and how long you have trained on the corpus.

It just uses the example in kashgari Seq2seq tutorial.

The tokenizer I tried basic Bert, Ernie, and the result is similar.

I tried the seq2seq in text2sql task, the problem happens as well. It tends to repeat and generate sentences that are irrelevant with train_y.

how many sentences you used for training.

BrikerMan avatar Jul 07 '21 02:07 BrikerMan

Wow. This is the first time I see some one used seq2seq on Uighur language generation. What kind tokenization process you have used? And please share the seq2seq definition code, and how long you have trained on the corpus.

It just uses the example in kashgari Seq2seq tutorial. The tokenizer I tried basic Bert, Ernie, and the result is similar. I tried the seq2seq in text2sql task, the problem happens as well. It tends to repeat and generate sentences that are irrelevant with train_y.

how many sentences you used for training.

About 30000 sentences

TechSang avatar Jul 07 '21 02:07 TechSang

It is most likely a tokenizer issue. You need to use a proper tokenizer so that your sentence's tokens match the embedding vocab.

For example, after you have loaded the embedding.

# lets assume vocab2idx is 
embedding. vocab2idx = {
  'A': 0,
  'B': 1,
  'C': 2,
  '[UNK]': 3,
}


sentence = 'A A B C D E F G H'.split() # sentence to words list by space
token = embedding._text_processor.transform([sentence])

# What is expected is something like [0 0 1 2 4 5 6 7]
# but what actually retuns is [0 0 1 2 3 3 3 3 3 3]

So all the tokens not in embeddings vocab will translate into a predefined token, which caused this issue.

BrikerMan avatar Jul 07 '21 02:07 BrikerMan

It is most likely a tokenizer issue. You need to use a proper tokenizer so that your sentence's tokens match the embedding vocab.

For example, after you have loaded the embedding.

# lets assume vocab2idx is 
embedding. vocab2idx = {
  'A': 0,
  'B': 1,
  'C': 2,
  '[UNK]': 3,
}


sentence = 'A A B C D E F G H'.split() # sentence to words list by space
token = embedding._text_processor.transform([sentence])

# What is expected is something like [0 0 1 2 4 5 6 7]
# but what actually retuns is [0 0 1 2 3 3 3 3 3 3]

So all the tokens not in embeddings vocab will translate into a predefined token, which caused this issue.

Got it! I will try to fix it. Thanks for your help!!

TechSang avatar Jul 07 '21 02:07 TechSang

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Apr 16 '22 03:04 stale[bot]