UniversalVocoding Help needed. Trying to get vocoder working with output from a ML Tracotron

Hello,

I'm trying to figure out what I need to do so to my numpy array can be vocoded by the UniversalVocoder.

Attached is a sample npy file.

The output is from a modified https://github.com/Tomiinek/Multilingual_Text_to_Speech

import os

import numpy


def main():
    import torch
    import soundfile as sf
    from univoc import Vocoder

    cwd: str = os.getcwd()

    # download pretrained weights (and optionally move to GPU)
    vocoder: Vocoder = Vocoder.from_pretrained(
            "https://github.com/bshall/UniversalVocoding/releases/download/v0.2/univoc-ljspeech-7mtpaq.pt").cuda()

    # load log-Mel spectrogram from file or from tts (see https://github.com/bshall/Tacotron for example)
    mel = numpy.load(os.path.join(cwd, "tmp.npy"))

    # generate waveform
    with torch.no_grad():
        wav, sr = vocoder.generate(mel)

    # save output
    sf.write(os.path.join(cwd, "tmp.wav"), wav, sr)


if __name__ == "__main__":
    main()

Traceback (most recent call last):
  File "/home/muksihs/git/Cherokee-TTS/tts-wrapper/uv.py", line 29, in <module>
    main()
  File "/home/muksihs/git/Cherokee-TTS/tts-wrapper/uv.py", line 22, in main
    wav, sr = vocoder.generate(mel)
  File "/home/muksihs/miniconda3/envs/UniversalVocoding/lib/python3.9/site-packages/univoc/model.py", line 102, in generate
    mel, _ = self.rnn1(mel)
  File "/home/muksihs/miniconda3/envs/UniversalVocoding/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/muksihs/miniconda3/envs/UniversalVocoding/lib/python3.9/site-packages/torch/nn/modules/rnn.py", line 821, in forward
    max_batch_size = input.size(0) if self.batch_first else input.size(1)
TypeError: 'int' object is not callable

tmp.npy.zip wavernn-vocoded.zip

Oct 15 '21 18:10 michael-conrad

I've also tried the following and now I'm getting "RuntimeError: input.size(-1) must be equal to input_size. Expected 80, got 386":

mel_npy: array = numpy.load(os.path.join(cwd, "tmp.npy"))
mel_npy = mel_npy.reshape((1, mel_npy.shape[0], mel_npy.shape[1]))
mel_tensor: Tensor = torch.tensor(mel_npy).to("cuda")
print(mel_tensor.shape)

torch.Size([1, 80, 386])
Traceback (most recent call last):
  File "/home/muksihs/git/Cherokee-TTS/tts-wrapper/uv.py", line 35, in <module>
    main()
  File "/home/muksihs/git/Cherokee-TTS/tts-wrapper/uv.py", line 28, in main
    wav, sr = vocoder.generate(mel_tensor)
  File "/home/muksihs/miniconda3/envs/UniversalVocoding/lib/python3.9/site-packages/univoc/model.py", line 102, in generate
    mel, _ = self.rnn1(mel)
  File "/home/muksihs/miniconda3/envs/UniversalVocoding/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/muksihs/miniconda3/envs/UniversalVocoding/lib/python3.9/site-packages/torch/nn/modules/rnn.py", line 835, in forward
    self.check_forward_args(input, hx, batch_sizes)
  File "/home/muksihs/miniconda3/envs/UniversalVocoding/lib/python3.9/site-packages/torch/nn/modules/rnn.py", line 229, in check_forward_args
    self.check_input(input, batch_sizes)
  File "/home/muksihs/miniconda3/envs/UniversalVocoding/lib/python3.9/site-packages/torch/nn/modules/rnn.py", line 205, in check_input
    raise RuntimeError(
RuntimeError: input.size(-1) must be equal to input_size. Expected 80, got 386

Oct 15 '21 18:10 michael-conrad

I finally figured out it needed a transpose, but, the generated wav is all silence?

    mel_npy: array = numpy.load(os.path.join(cwd, "tmp.npy")).transpose()
    mel_npy = mel_npy.reshape((1, mel_npy.shape[0], mel_npy.shape[1]))
    mel_tensor: Tensor = torch.tensor(mel_npy).to("cuda")

    # generate waveform
    with torch.no_grad():
        wav, sr = vocoder.generate(mel_tensor)

        # save output
        sf.write(os.path.join(cwd, "tmp.wav"), wav, sr)

Oct 15 '21 19:10 michael-conrad

The following seems to work. Definitely different sounding...

universalvocoding.zip

    mel_npy: array = numpy.load(os.path.join(cwd, "tmp.npy")).transpose()
    top_db = 80
    mel_npy = numpy.maximum(mel_npy, -top_db)
    mel_npy = mel_npy / top_db
    mel_tensor: Tensor = torch.FloatTensor(mel_npy).unsqueeze(0).to("cuda")

    # generate waveform
    with torch.no_grad():
        wav, sr = vocoder.generate(mel_tensor)

        # save output
        sf.write(os.path.join(cwd, "tmp.wav"), wav, sr)

Oct 15 '21 19:10 michael-conrad

Hi @michael-conrad,

Apart from the normalization steps, the parameters used to extract the mel-spectorgram need to be the same as the ones used in this repo. From a cursory glance at https://github.com/Tomiinek/Multilingual_Text_to_Speech it looks like their model is trained on spectrograms from 22050Hz audio with a different hop-length and window-length to what I used here.

To fix this you have two options: 1. retrain the vocoder (with some minor modifications) using their spectrograms as the input; or 2. retrain the acoustic model at https://github.com/Tomiinek/Multilingual_Text_to_Speech to produce spectrograms with matching parameters.

Oct 18 '21 12:10 bshall

I'm prepping to run another test with a fork of it.

I'm looking in https://github.com/CherokeeLanguage/Cherokee-TTS/blob/master/params/params.py and trying to figure out what to change. I see there is a normalize setting. I think the script https://github.com/CherokeeLanguage/Cherokee-TTS/blob/master/data/prepare_spectrograms.py handles that. I should figure out how to normalize in this file to match the vocoder?

"""
    ******************** PARAMETERS OF AUDIO ********************
    """

    sample_rate = 22050  # sample rate of source .wavs, used while computing spectrograms, MFCCs, etc.
    num_fft = 1102  # number of frequency bins used during computation of spectrograms
    num_mels = 80  # number of mel bins used during computation of mel spectrograms
    num_mfcc = 13  # number of MFCCs, used just for MCD computation (during training)
    stft_window_ms = 50  # size in ms of the Hann window of short-time Fourier transform, used during spectrogram computation
    stft_shift_ms = 12.5  # shift of the window (or better said gap between windows) in ms
    griffin_lim_iters = 60  # used if vocoding using Griffin-Lim algorithm (synthesize.py), greater value does not make much sense
    griffin_lim_power = 1.5  # power applied to spectrograms before using GL
    normalize_spectrogram = True  # if True, spectrograms are normalized before passing into the model, a per-channel normalization is used
    # statistics (mean and variance) are computed from dataset at the start of training
    use_preemphasis = True  # if True, a preemphasis is applied to raw waveform before using them (spectrogram computation)
    preemphasis = 0.97  # amount of preemphasis, used if use_preemphasis is True

Oct 18 '21 22:10 michael-conrad

UniversalVocoding UniversalVocoding copied to clipboard

Help needed. Trying to get vocoder working with output from a ML Tracotron

UniversalVocoding
UniversalVocoding copied to clipboard