Transformer-TTS Repeating words at the end of sentences

When inputting long sentences, I found the model tended to repeat the ending words over and over again. I trained this model over the blizzard2011 challenge database and both of the transformer and postnet were trained for over 500k iterations. The loss function looked like it converged pretty well.... Has anyone also came across this problem? Please guide my somehow to fix this problem

Apr 24 '19 08:04 TakoYuxin

Do you add "stop token" predicetion,I also meet this problem.

May 10 '19 01:05 sunnnnnnnny

Do you add "stop token" predicetion,I also meet this problem.

I did not add stop token loss for the first try. However, even after adding the stop token prediction, the model also tends to repeat ending words if the checkpoints are not carefully chosen. Also, I found it took more time to converge after adding the stop token prediction. Did you try to add a stop token prediction and how did that work out?

Here are my codes for calculating stop token loss. stop_tokens = t.abs(pos_mel.ne(0).type(t.float) - 1).cuda() pos_mask = t.sum(pos_mel.ne(0),1) pos_w_matrix = t.zeros(pos_mel.size()) for i in range(pos_w_matrix.size()[0]): pos_w_matrix[i, pos_mask[i]] = 7. pos_w_matrix = pos_w_matrix.cuda() stop_tokens_loss = nn.BCEWithLogitsLoss(pos_weight=pos_w_matrix)(stop_preds, stop_tokens) I used a separate optimizer to train the stop token linear projection parameters.

May 10 '19 02:05 TakoYuxin

@TakoYuxin what's the batchsize number in your hparameters, and how many step can get intelligence speech result.

Jun 16 '19 13:06 WhiteFu

@TakoYuxin what's the batchsize number in your hparameters, and how many step can get intelligence speech result.

The batchsize is 16 and it took about 100K steps to get intelligible speech results without calculating the stop token loss. However, several words may be repeated or ignored in the result sentences.

Jun 17 '19 01:06 TakoYuxin

Thanks for your reply! Have you used multi-GPU training style? I am training in a single gpu with 16 samples per step. And I can't get intelligence speech result.

Jun 17 '19 02:06 WhiteFu

Thanks for your reply! Have you used multi-GPU training style? I am training in a single gpu with 16 samples per step. And I can't get intelligence speech result.

That's weird. I was also training on a single GPU. Did you change any other hparams?

Jun 17 '19 03:06 TakoYuxin

stop tokens in pad should be masked right? Because the padded mels will make the prediction of pad stop tokens easy. I think thats why its not learning the stop token

Aug 25 '19 10:08 flip-arunkp

Transformer-TTS Transformer-TTS copied to clipboard

Repeating words at the end of sentences

Transformer-TTS
Transformer-TTS copied to clipboard