FastSpeech2 icon indicating copy to clipboard operation
FastSpeech2 copied to clipboard

Support to HiFiGan

Open loretoparisi opened this issue 4 years ago • 10 comments

HiFiGan has sota results in wav generation from mel spectrograms

Schermata 2020-12-23 alle 12 14 08

Is it possibile to add support to hifigan model, after the mel generation, in order to create the wave file?

    mel, mel_postnet, log_duration_output, f0_output, energy_output, _, _, mel_len = model(text, src_len)
    
    mel_torch = mel.transpose(1, 2)
    mel_postnet_torch = mel_postnet.transpose(1, 2)
    mel = mel[0].cpu().transpose(0, 1)
    mel_postnet = mel_postnet[0].cpu().transpose(0, 1)
    f0_output = f0_output[0].cpu().numpy()
    energy_output = energy_output[0].cpu().numpy()

    if not os.path.exists(hp.test_path):
        os.makedirs(hp.test_path)

    if melgan is not None:
        with torch.no_grad():
            wav = melgan.inference(mel_torch).cpu().numpy() # use here hifgan?
            wav = wav.astype('int16')
            #ipd.display(ipd.Audio(wav, rate=hp.sampling_rate))
            # save audio file
            write(os.path.join(GENERATED_SPEECH_DIR, prefix + '.wav'), hp.sampling_rate, wav)

or some additional adaptation would be needed?

In the case of the end-to-end inference with hifi gan the generation code would look like

def inference(a):
    generator = Generator(h).to(device)

    state_dict_g = load_checkpoint(a.checkpoint_file, device)
    generator.load_state_dict(state_dict_g['generator'])
    generator.eval()
    generator.remove_weight_norm()
    with torch.no_grad():
        x = torch.FloatTensor( mel_torch ).to(device)
        y_g_hat = generator(x)
        audio = y_g_hat.squeeze()
        audio = audio * MAX_WAV_VALUE
        audio = audio.cpu().numpy().astype('int16')
       write(os.path.join(GENERATED_SPEECH_DIR, prefix + '.wav'), hp.sampling_rate, audio)

where mel_torch is our mel spectrogram.

loretoparisi avatar Dec 23 '20 11:12 loretoparisi

Thanks for your suggestion. It is supported now and indeed the audio quality is much better!

ming024 avatar Feb 26 '21 04:02 ming024

@ming024 super, let me try it out How I can choose it for the english voice? thanks

loretoparisi avatar Feb 26 '21 09:02 loretoparisi

Hi, thanks for your efforts in putting this amazing repo together! With your latest changes, I get

FileNotFoundError: [Errno 2] No such file or directory: 'hifigan/config.json'

when running synthesize.py. Would you mind adding the hifigan config as well?

chrr avatar Feb 26 '21 11:02 chrr

@loretoparisi In my experience vocoders are generally independent or weakly-dependent to languages. So feel free to try it. @chrr I somehow forgot to upload the hifigan/ directory. It should be fixed now.

ming024 avatar Feb 26 '21 15:02 ming024

Hey @ming024 , I am working on Arabic which has a different script than English, will that affect the results ? Also, should I use the universal hifigan model?

zaidalyafeai avatar Apr 14 '21 16:04 zaidalyafeai

@zaidalyafeai I believe the universal HiFiGAN yields the best result for unknown speakers. I also think that there may not be a great performance drop of the pretrained vocoders for different languages, as long as the same preprocessing hyperparameters are used.

ming024 avatar Apr 15 '21 09:04 ming024

Thanks @ming024 , I tested both vocoders and indeed the universal is much better. Which preprocessing hyperparameters mostly affect the vocoders?

zaidalyafeai avatar Apr 15 '21 13:04 zaidalyafeai

@zaidalyafeai the preprocessing parameters should match that of the pretrained vocoders, or there may be strange results.

ming024 avatar May 26 '21 08:05 ming024

@zaidalyafeai @ming024 Did you use the pre-trained one that already exists for the Universal vocoder? Or did you train it from scratch when you used it, for example, with Arabic data? I am trying now to add a pretrained VITS vocoder to the Fastspeech (using the same preprocessing hyperparameters). However, I only get the noisy voice generated. Thanks for your answer in advance!

malradhi avatar Mar 20 '22 14:03 malradhi

@zaidalyafeai @ming024 Did you use the pre-trained one that already exists for the Universal vocoder? Or did you train it from scratch when you used it, for example, with Arabic data? I am trying now to add a pretrained VITS vocoder to the Fastspeech (using the same preprocessing hyperparameters). However, I only get the noisy voice generated. Thanks for your answer in advance!

No, you can not do that. The standard hifigan vocoder, is trained from mel spectrogram into wavform. so it can be used as vocoder to FastSpeech2. Your VITS decoder part 【have nearly the same structure with hifigan】 is trained to generate wavform from the VITS lattent variable "z", not mel-spectrogram. so they are COMPLETElY DIFFERENT.

JohnHerry avatar Apr 12 '23 09:04 JohnHerry