FastSpeech2 Support to HiFiGan

HiFiGan has sota results in wav generation from mel spectrograms

Is it possibile to add support to hifigan model, after the mel generation, in order to create the wave file?

    mel, mel_postnet, log_duration_output, f0_output, energy_output, _, _, mel_len = model(text, src_len)
    
    mel_torch = mel.transpose(1, 2)
    mel_postnet_torch = mel_postnet.transpose(1, 2)
    mel = mel[0].cpu().transpose(0, 1)
    mel_postnet = mel_postnet[0].cpu().transpose(0, 1)
    f0_output = f0_output[0].cpu().numpy()
    energy_output = energy_output[0].cpu().numpy()

    if not os.path.exists(hp.test_path):
        os.makedirs(hp.test_path)

    if melgan is not None:
        with torch.no_grad():
            wav = melgan.inference(mel_torch).cpu().numpy() # use here hifgan?
            wav = wav.astype('int16')
            #ipd.display(ipd.Audio(wav, rate=hp.sampling_rate))
            # save audio file
            write(os.path.join(GENERATED_SPEECH_DIR, prefix + '.wav'), hp.sampling_rate, wav)

or some additional adaptation would be needed?

In the case of the end-to-end inference with hifi gan the generation code would look like

def inference(a):
    generator = Generator(h).to(device)

    state_dict_g = load_checkpoint(a.checkpoint_file, device)
    generator.load_state_dict(state_dict_g['generator'])
    generator.eval()
    generator.remove_weight_norm()
    with torch.no_grad():
        x = torch.FloatTensor( mel_torch ).to(device)
        y_g_hat = generator(x)
        audio = y_g_hat.squeeze()
        audio = audio * MAX_WAV_VALUE
        audio = audio.cpu().numpy().astype('int16')
       write(os.path.join(GENERATED_SPEECH_DIR, prefix + '.wav'), hp.sampling_rate, audio)

where mel_torch is our mel spectrogram.

Dec 23 '20 11:12 loretoparisi

Thanks for your suggestion. It is supported now and indeed the audio quality is much better!

Feb 26 '21 04:02 ming024

@ming024 super, let me try it out How I can choose it for the english voice? thanks

Feb 26 '21 09:02 loretoparisi

Hi, thanks for your efforts in putting this amazing repo together! With your latest changes, I get

FileNotFoundError: [Errno 2] No such file or directory: 'hifigan/config.json'

when running synthesize.py. Would you mind adding the hifigan config as well?

Feb 26 '21 11:02 chrr

@loretoparisi In my experience vocoders are generally independent or weakly-dependent to languages. So feel free to try it. @chrr I somehow forgot to upload the hifigan/ directory. It should be fixed now.

Feb 26 '21 15:02 ming024

Hey @ming024 , I am working on Arabic which has a different script than English, will that affect the results ? Also, should I use the universal hifigan model?

Apr 14 '21 16:04 zaidalyafeai

@zaidalyafeai I believe the universal HiFiGAN yields the best result for unknown speakers. I also think that there may not be a great performance drop of the pretrained vocoders for different languages, as long as the same preprocessing hyperparameters are used.

Apr 15 '21 09:04 ming024

Thanks @ming024 , I tested both vocoders and indeed the universal is much better. Which preprocessing hyperparameters mostly affect the vocoders?

Apr 15 '21 13:04 zaidalyafeai

@zaidalyafeai the preprocessing parameters should match that of the pretrained vocoders, or there may be strange results.

May 26 '21 08:05 ming024

@zaidalyafeai @ming024 Did you use the pre-trained one that already exists for the Universal vocoder? Or did you train it from scratch when you used it, for example, with Arabic data? I am trying now to add a pretrained VITS vocoder to the Fastspeech (using the same preprocessing hyperparameters). However, I only get the noisy voice generated. Thanks for your answer in advance!

Mar 20 '22 14:03 malradhi

@zaidalyafeai @ming024 Did you use the pre-trained one that already exists for the Universal vocoder? Or did you train it from scratch when you used it, for example, with Arabic data? I am trying now to add a pretrained VITS vocoder to the Fastspeech (using the same preprocessing hyperparameters). However, I only get the noisy voice generated. Thanks for your answer in advance!

No, you can not do that. The standard hifigan vocoder, is trained from mel spectrogram into wavform. so it can be used as vocoder to FastSpeech2. Your VITS decoder part 【have nearly the same structure with hifigan】 is trained to generate wavform from the VITS lattent variable "z", not mel-spectrogram. so they are COMPLETElY DIFFERENT.

Apr 12 '23 09:04 JohnHerry

FastSpeech2 FastSpeech2 copied to clipboard

Support to HiFiGan

FastSpeech2
FastSpeech2 copied to clipboard