FFTNet icon indicating copy to clipboard operation
FFTNet copied to clipboard

Did you use this repo to train a vocoder?

Open syang1993 opened this issue 7 years ago • 3 comments

@fatchord Hi, happy to see you again! I'm also working on the FFTNet. But in my experiments, I cannot get the similar results of the paper's demo page, mainly about conditional sampling and post-denoising. Do you try to reconstruct their results? Thanks.

syang1993 avatar Jul 20 '18 06:07 syang1993

@syang1993 Hi, how's it going? Yeah, I'm having similar problems - here's what my conditioned model sounds like after 300k steps: (used 80-band mel-spectrograms)

300k_steps.wav.tar.gz

I haven't implemented the noise reduction, is that algorithm publicly available? I had a quick look around and couldn't find it.

As for conditional sampling - I was going to implement a simple threshold or perhaps an exponential moving average from the summed values in the conditioning frames - and use that to differentiate between a voiced/unvoiced state. But haven't got around to it yet so perhaps that's why it doesn't sound so good.

I'm curious what your implementation sounds like - any chance you could post a sample?

fatchord avatar Jul 20 '18 16:07 fatchord

@fatchord I also used the 80-band mel-spectrogram to train my model. Since the author cited a book for noise reduction, I don't know what specific method they use, maybe the wiener filtering?

Since I'm on a summer vocation, I can't send you my samples. But you can listen the generated-model.ckpt-200000.ema.pt.wav in https://github.com/syang1993/FFTNet/issues/2 , my results are the same like that (without condition sampling and noise-reduction). It contains strident audio at some positions. When I tried to use random sampling rather argmax, the generated speech will get noisy.

syang1993 avatar Jul 21 '18 17:07 syang1993

@fatchord this is not bad at all, although I know the goal is to replicate the paper quality results.

alirezag avatar Oct 19 '18 08:10 alirezag