Tacotron attention when generating TTS blows up after generating the sentence

Open 24karatt opened this issue 6 years ago • 0 comments

I have a 320k steps tacotron model I'm trying to train with a custom dataset, it sort of generates sentences but very often the attention of the sentence does this: __input_Hello ther_griffinlim_316k

Listening to the audio, you can tell he says the sentence, but afterwards it starts generating a lot of noise and blabbering speech. What could possibly be causing this?

Here's a picture of the attention at 320k steps: 316992 (i can see a lot of sparse attention which i thought further training would purge but it didn't really do much.)

I noticed that it failed to make a fully diagonal line and I'm not sure what the problem with my dataset is, it's 16bit, 22050hz mono wav files just like LJSpeech. Is 1 hour insufficient or should i look for silence in the files? I also am having trouble testing with different parameters, as very often it fails to develop the initial attention after 10k steps but i saw in #77 that it's a current limitation, so far they all came out like this: 9968 except the 320k steps one, which came out (mostly) correctly.

Oct 01 '19 14:10 24karatt