Kaizhi Qian
Kaizhi Qian
The spectrogram should be between 0 and 1. Anyway, the fast vocoder is released. See README.
No, but this shouldn't matter.
I don't have access to that code. But it is very simple to implement by looking up the formulae in this paper http://www.seas.ucla.edu/spapl/paper/chu_icassp_09.pdf
No I didn't.
The first dim is the number of frames. There is no dimension reduction. It should be around 90 for that utterance. Please double-check your code.
The sampling rate should be 16k instead of 48k
The length does not have to be 90. As long as the sampling frequency is correct, it should be fine.
No. and the procedures for downsampling should not make a big difference.
OK. That explains it. I trimmed the silence off by hand.
The answer is 2. You will need µ and σ for inference. However, for unseen speakers, you can normalize using its own µ and σ, which is not a bad...