Tacotron
Tacotron copied to clipboard
Did you try to replace LSA in Tacotron2(Nvidia) with this DCA implemention ?
Hi! Recently, I'm trying to use your DCA implemention in Tacotron2(Nvidia version) on LJSpeech dataset. The alignment fails at most time like the pic below, do you have any idea what wrong is going on?
Hi. I had similar problems when I tried to apply DCA to NVIDIA/tacotron2. (e.g. it can not learn the alignment, output NaN loss) Then, when I use beta=6.3 and grad_thresgold=0.05 while using the stop prediction loss, the problems seem to be solved. However, Although the alignment becomes stable through the training, I notice that the training loss decreases much slower than the original tacotron2 and the validation loss even increases. :( I've started another training with noam_lr_scheduling, so I will share the result when the training progresses to a certain extent. If my comment helps you, I hope to hear about your experience later. :D
Hi. I had similar problems when I tried to apply DCA to NVIDIA/tacotron2. (e.g. it can not learn the alignment, output NaN loss) Then, when I use beta=6.3 and grad_thresgold=0.05 while using the stop prediction loss, the problems seem to be solved. However, Although the alignment becomes stable through the training, I notice that the training loss decreases much slower than the original tacotron2 and the validation loss even increases. :( I've started another training with noam_lr_scheduling, so I will share the result when the training progresses to a certain extent. If my comment helps you, I hope to hear about your experience later. :D
I met exactly the same problems (bad alignment and Nan loss) on two datasets LJspeech and Blizzard2013. I change grad_thresgold to 0.5 and 0.05, but they do not help. Here are some confusion and thoughts: (1) Changing beta from 0.9 to 6.3 means accelerating the movement of alignment at each decoder step? Is that what you plan? (2) I don't think the grad_thresgold is the key solution, but your noam_lr_scheduling may be vital to DCA (because the original paper emphasizes it). I also plan to set lr to different values at different iterations. Your comments are so helpful! Please let me know the noam_lr_scheduling result.
(1) Higher beta rather decelerates the movement of alignment at each decoder step. (average αn/(α + β) step) (2) When I apply noam_lr_scheduling, it seems really helpful for the training.
@LeoniusChen could you provide a snippet of how did you implement DCA in Nvidia Tacotron 2?
(1) Higher beta rather decelerates the movement of alignment at each decoder step. (average αn/(α + β) step) (2) When I apply noam_lr_scheduling, it seems really helpful for the training.
excuse me,could you share the part of the DVA in Taco2 for study,i have some problem in revising, help please, thanks deeply!
@LeoniusChen could you provide a snippet of how did you implement DCA in Nvidia Tacotron 2?
I just use the DCA implemention in this repo and replace LSA with it. 😂 There is nothing special. The reults of my implemention are not satisfying. You may need forward attention.
@LeoniusChen @LEEYOONHYUNG the problem isn't really about the beta being too small and thus pushing the attention forward too fast. The most likely (verified) problem here is that during the paper writing alpha
and beta
values were flipped by mistake (or I misunderstand the paper). In the context of this model, Figure 1 of the paper can only be reproduced with alpha=0.9, beta=0.1
and causal padding on the prior convolution.
Beta binomial distribution is symmetric meaning that its pmf for alpha=a and beta=b
is the perfect reverse of its pmf for alpha=b and beta=a
. The prior filter of Figure 1 in the paper also contains the mistake of alpha
and beta
parameters. The correct prior filter is supposed to be the reverse (~0.74 on index 10 not 0).
Following picture shows the difference between paper described prior filter (blue) and the actual prior filter that recreates behavior of repeated application in Figure 1 (red).
Hope this helps.
Hi, @Rayhane-mamah , I also note the decreasing trend of binomial distribution for alpha=0.1 and beta=0.9. However, the author of this repo has reversed the distribution array as below codes which is equal to alpha=0.9 and beta=0.1. Did you set reduction factor (in tacotron) > 2 when you implement DCA ? I only set it to 1, and I think this may be critical. https://github.com/bshall/Tacotron/blob/6fee34a7c3a9d4ceb9215ed3063771a9287010e1/tacotron/model.py#L164
Ah I missed the reverse part. Nice catch. No I am using a simple reduction factor of 1 and the alignment is working well. I can imagine tho that the optimal prior filter would change by dataset. In general, it is a bit weird to consider doing a prior filter that pushes alignments 1 step further for each decoder step. In TTS, that is usually not True especially with a reduction factor of 1. In that case maybe increasing beta as discussed earlier in this discussion is a good idea
@nicemanis @ccuiyuhan Now you can use codes in this Repo to implement DCA or GMM attention.