tacotron icon indicating copy to clipboard operation
tacotron copied to clipboard

raising the predicted magnitudes by a power of 1.2, not input magnitudes

Open Durham opened this issue 8 years ago • 4 comments

I may be wrong with this, because I haven't time to study this in detail, but paper says: "raising the predicted magnitudes by a power of 1.2 before feeding to Griffin-Lim reduces artifacts, likely due to its harmonic enhancement effect". In the code, I see that input values (training data) magnitudes are raised by 1.2, but from statement above, it seems that it should be done to output values, just before spectrogram2wav. Nothing is said about raising input magnitudes by power of 1.2.

Durham avatar May 30 '17 08:05 Durham

Well, does that make any difference?

Kyubyong avatar Jun 02 '17 06:06 Kyubyong

As we have found, the output spectrogram should be raised by a power of 1.2. Then the voice sounds more clear

wenjunpku avatar Jun 03 '17 10:06 wenjunpku

Okay, let me give you an example. Correct me if I'm wrong.

Assume a value of an element of the original magnitude is 2. The best model should output 2, of course. Then it is raised by power of 1.2, that is, 2.29. before being converted into the wav form. In contrast,I raised the target value, so our model tries to learn 2.29.

Kyubyong avatar Jun 03 '17 12:06 Kyubyong

Actually, I've changed the relevant codes.

See https://github.com/Kyubyong/tacotron/blob/master/utils.py#L42 and https://github.com/Kyubyong/tacotron/blob/master/eval.py#L65

I think this revision is closer to the paper. Thanks, guys!

Kyubyong avatar Jun 03 '17 15:06 Kyubyong