ParallelWaveGAN icon indicating copy to clipboard operation
ParallelWaveGAN copied to clipboard

Full-Band LPCNet

Open Freddy-pp opened this issue 2 years ago • 6 comments

@kan-bayashi , FYI. https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=9455356

Freddy-pp avatar Aug 10 '21 07:08 Freddy-pp

I know it. Dr. Toda is one of my bosses :)

kan-bayashi avatar Aug 10 '21 07:08 kan-bayashi

@Freddy-pp One question, what's so special about it? Stock MB-MelGAN and HiFi-GAN are capable of outputting good 44.1KHz and 48KHz audio with just a few hparam and preprocessing changes.

ZDisket avatar Aug 10 '21 08:08 ZDisket

Hello @ZDisket. I tested (on espnet+PWG and TensorFlowTTS) pairs of FS2+MB-MelGAN, FS2+MelGAN, FS2+FB-MelGAN and FS2+PWGAN on single speaker 20 hours studio recorded male russian dataset. FS2+PWGAN - sounds ideal, but extremely slow in CPU inference. All Melgan's have artifacts even on 22KHz in places where two vowels follow each other (not looks as it decoder related, because PWG with those FS2 model is perfect). On 8Khz FS2-MB-Melgan sounds good - similar to recordings. With Tacotron2 I could try fine-tune based on GTA, but with FS2 it's problematic. Now I'm training HiFiGan on the same data at 22Khz using code from this repo, and on 150K it have the same behavior with FS2 models as with Melgan's. Maybe on the next iterations it would be better. But in any case it very slow on CPU inference, just 2-3 times faster than PWG, but 15-20 times slower than MBMG. However, I was able to train FS2-MB-Melgan 22KHz English female single-speaker (not LJ) and it sounds very very good. And even better than FS2-PWG. My impression - for some voices (especially female) MBMG for TTS can do very well. For some, not. Even in the article mentioned above you could find the following:

Although Multi-band MelGAN [22] can realize real-time synthesis with multiple CPU cores, it was not included in the experiments because the synthesis quality of Multi-band MelGAN was significantly worse than that of Parallel WaveGAN in preliminary experiments with a sampling frequency of 24 kHz. Although LVCNet [25] can realize real-time synthesis with a CPU for 24 kHz audio, it was also not included in the experiments because its synthesis quality was almost the same as that of Parallel WaveGAN [25].

At the same time, I read articles and listened to synthesized audio created for male Russian voices in a bundle Taco2 + modified for 22KHz LPCNet and I liked the quality there very much.

Freddy-pp avatar Aug 10 '21 10:08 Freddy-pp

Small update. StyleMelGAN (1.5M iter) is much better than HiFi-GAN (1.5M iter) as vocoder after FastSpeech2 for my dataset. FS2+StyleMelGAN almost the same quality as FS2+PWG, but SMG 3 times faster than PWG.

Freddy-pp avatar Sep 14 '21 06:09 Freddy-pp

Small update. StyleMelGAN (1.5M iter) is much better than HiFi-GAN (1.5M iter) as vocoder after FastSpeech2 for my dataset. FS2+StyleMelGAN almost the same quality as FS2+PWG, but SMG 3 times faster than PWG.

Which is faster, StyleMelGAN and HiFi-GAN?

wizardk avatar Nov 02 '22 05:11 wizardk

Small update. StyleMelGAN (1.5M iter) is much better than HiFi-GAN (1.5M iter) as vocoder after FastSpeech2 for my dataset. FS2+StyleMelGAN almost the same quality as FS2+PWG, but SMG 3 times faster than PWG.

Which is faster, StyleMelGAN and HiFi-GAN?

They all are pretty fast

shigabeev avatar May 28 '23 10:05 shigabeev