Tacotron-2
Tacotron-2 copied to clipboard
outputs_per_step
Hello,
I understand what outputs_per_step does, however, should it always be a low value (ie, around 2 or 5) or it can be about 50? How is it possible that you set the batch size in tacotron to 32? the only way I can set it to 32 is if I have outputs_per_step set to 50. is that realistic? and I have a Geforce 1080 11GB of ram. thanks
Hello, you most probably have excessively long utterances in your training corpus.. normally with batch_size=32 and r=2, tacotron only uses 8.8Gb of VRAM. Are your samples like 1 min long? Also are you not using the following parameter to not use very long utterances? https://github.com/Rayhane-mamah/Tacotron-2/blob/d13dbba16f0a434843916b5a8647a42fe34544f5/hparams.py#L18
I personally don't think r=50 is a good idea at all. think of it as if you're telling the model to make 50 consecutive predictions at the same time, by "guessing" the current [1 to 49] outputs and use that guess to make the 50th prediction. Let me explain it better: We train the model on the conditional probability of predicting a mel frame y given all previous mel frames y_<t. When we use r=2, we're assuming that the model is capable of making 2 correct predictions without explicitly conditioning y2 on y1 for each decoder step. Instead the model relies on previous (older) steps in this prediction. If you use r=50, you're asking from the model to look back way too far, 50 frames may hold multiple phonemes even words, so no I don't think it's possible..
You are right. Actually, my issue is, that I have modified the inputs of the model, and, the number of frames I have in input, is equal to the one I have in output. I have audio files aligned with their phonemes. So technically there is no need to use seq2seq, however, if I dont use seq2seq, I obtain extremely oversmoothed spectrograms... any idea about this?
Also, would you consider this a good prediction? this was done without the help of ground truth. upper one is truth, and lower one is predicted! I feel there are several differences. cant make up my mind.
Hey again @DLZML001,
hmm yeah using seq2seq for already time aligned features feels a little bit weird actually.. Usually seq2seq is used to capture some "duration" information when we can't (or don't want to) make one-to-one alignments between inputs and outputs.
I assume those are predictions from your non-seq2seq spectrograms? It kind of feels over-flattened to not say over-smoothed (we usually refer to blurriness when we say smoothed around here). I presume it's feeling too "machine-like" when listening to the samples? usually, the more there is variation in frequency levels, the more natural speech sounds..
Is there any particular reason that pushes you toward using time-aligned input-output pairs? Can't you just get phonetic representations (like words converted with CMU dictionaries) and let the seq2seq model do the job? it usually sounds more natural than usual parametric approaches. (Which was also pointed in T2 paper).
Actually the fear of using our T2 architecture with such inputs is that the encoder won't capture time dependencies correctly. In our model, the encoder operates on characters, and uses convolutions to create N-grams. If the inputs are already tiled across time to match the output time length, then those convolutions will pretty much loose some of their ability, resulting in the flattened speech. That's my opinion.
If you really insist on working with time-aligned features, what you can try is maybe enhance your mapping model to use convolutions and recurrent cells (like our encoder) to map from inputs to mels directly (no need for attention and decoder in your case really..). I am saying this while hoping that a stack of convolutions may enhance the receptive field of the RNN in a way it can capture the long range time dependencies and with it reduce flatness. Another thing you can try is to use a CBHG-like architecture (those are powerful in capturing context in sequences). We use one as post processing network that converts mels to linear spectrograms.
If those really still didn't make the speech natural, then I guess a seq2seq with phonetic representations as inputs is your best shot (at least like that, your efforts on getting phonemes and making the alignments are only 1/2 lost instead of completely thrown away :) ).
That is obviously my personal opinion, someone might have better ideas. Let me know if I misunderstood anything or if I can assist you in this adventure ;)
EDIT: Actually @DLZML001 if you want to have this conversation faster than using github comments, you can join this slack, we usually help each other out there to get best Tacotron based models we can. So others might help you with some ideas as well :)
Also like that, if you want to share samples privately (due to data obligations or other), you can do it easily ;)
First, thanks a lot for your detailed answer.
As for the mel spec images, the predicted one was obtained with the tacotron2 seq2seq model. With the non seq2seq ones I obtain much more oversmoothed mel spec.
When I use the seq2seq tacotron2 architecture with my input-output aligned, I do obtain the alignement of the attention perfectly well. it gives the pefect 'diagonal' curve aligning encoder and decoder. However as shown in the image in my previous comment, still over smoothed.
I have already tried a tacotron-like architecture, with the CBHG modules, without attention, and this is what is giving me extremely oversmoothed mel specs.
I guess I am going to try non-time aligned phonetic representation, as it might be easier for the network to lear pronunciation than with words.
Thank you for the slack invitation!
wait, if that is with tacotron then it's using r=50?? That would impressive if it is! And that would explain much of the flatness also..
well.. my advices went directly to the trashcan hahaha.. I am a guy who believes that it's time to start leaving most of the tasks to machines to figure out how to do them, so I usually tend to feel better working with end-to-end systems with minimal human intervention (or preprocessing). So, my opinion? yes go 100% with seq2seq approach with phonetic representation. Plus Tacotron-2 is kind of stable compared to other seq2seq models so it's a plus for it :)
No no, the r=50 I just tried it today briefly, and was almost sure that it would not work. The results were obtained with r=2 and batch size of 4 or 5. it was aligning as I said, but kind of oversmoothed specs.
I ll definitely try the seq2seq taco2 with non aligned phonetic features. and I ll let you know.
Thanks a lot for the help.
hmm yeah alignment is easy when it's one-to-one so small batch size isn't really a problem. makes sense.
Alright, I'll leave the issue open until I hear back from you. Let me know if you need anything ;)
Small batch size failed me in alignment. When I set r = 1 and then I had to shrink the batch size to 24 for single GPU, It never got alignment during the whole training. That is why I need multi-gpu version.
I am new to Tacotron2, Please I want to know what is r?