Kaizhi Qian

Results 196 comments of Kaizhi Qian

The spectrogram should be between 0 and 1. Anyway, the fast vocoder is released. See README.

No, but this shouldn't matter.

I don't have access to that code. But it is very simple to implement by looking up the formulae in this paper http://www.seas.ucla.edu/spapl/paper/chu_icassp_09.pdf

The first dim is the number of frames. There is no dimension reduction. It should be around 90 for that utterance. Please double-check your code.

The sampling rate should be 16k instead of 48k

The length does not have to be 90. As long as the sampling frequency is correct, it should be fine.

No. and the procedures for downsampling should not make a big difference.

OK. That explains it. I trimmed the silence off by hand.

The answer is 2. You will need µ and σ for inference. However, for unseen speakers, you can normalize using its own µ and σ, which is not a bad...