tacotron
tacotron copied to clipboard
raising the predicted magnitudes by a power of 1.2, not input magnitudes
I may be wrong with this, because I haven't time to study this in detail, but paper says: "raising the predicted magnitudes by a power of 1.2 before feeding to Griffin-Lim reduces artifacts, likely due to its harmonic enhancement effect". In the code, I see that input values (training data) magnitudes are raised by 1.2, but from statement above, it seems that it should be done to output values, just before spectrogram2wav. Nothing is said about raising input magnitudes by power of 1.2.
Well, does that make any difference?
As we have found, the output spectrogram should be raised by a power of 1.2. Then the voice sounds more clear
Okay, let me give you an example. Correct me if I'm wrong.
Assume a value of an element of the original magnitude is 2. The best model should output 2, of course. Then it is raised by power of 1.2, that is, 2.29. before being converted into the wav form. In contrast,I raised the target value, so our model tries to learn 2.29.
Actually, I've changed the relevant codes.
See https://github.com/Kyubyong/tacotron/blob/master/utils.py#L42 and https://github.com/Kyubyong/tacotron/blob/master/eval.py#L65
I think this revision is closer to the paper. Thanks, guys!