UniversalVocoding
UniversalVocoding copied to clipboard
Help needed. Trying to get vocoder working with output from a ML Tracotron
Hello,
I'm trying to figure out what I need to do so to my numpy array can be vocoded by the UniversalVocoder.
Attached is a sample npy file.
The output is from a modified https://github.com/Tomiinek/Multilingual_Text_to_Speech
import os
import numpy
def main():
import torch
import soundfile as sf
from univoc import Vocoder
cwd: str = os.getcwd()
# download pretrained weights (and optionally move to GPU)
vocoder: Vocoder = Vocoder.from_pretrained(
"https://github.com/bshall/UniversalVocoding/releases/download/v0.2/univoc-ljspeech-7mtpaq.pt").cuda()
# load log-Mel spectrogram from file or from tts (see https://github.com/bshall/Tacotron for example)
mel = numpy.load(os.path.join(cwd, "tmp.npy"))
# generate waveform
with torch.no_grad():
wav, sr = vocoder.generate(mel)
# save output
sf.write(os.path.join(cwd, "tmp.wav"), wav, sr)
if __name__ == "__main__":
main()
Traceback (most recent call last):
File "/home/muksihs/git/Cherokee-TTS/tts-wrapper/uv.py", line 29, in <module>
main()
File "/home/muksihs/git/Cherokee-TTS/tts-wrapper/uv.py", line 22, in main
wav, sr = vocoder.generate(mel)
File "/home/muksihs/miniconda3/envs/UniversalVocoding/lib/python3.9/site-packages/univoc/model.py", line 102, in generate
mel, _ = self.rnn1(mel)
File "/home/muksihs/miniconda3/envs/UniversalVocoding/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/home/muksihs/miniconda3/envs/UniversalVocoding/lib/python3.9/site-packages/torch/nn/modules/rnn.py", line 821, in forward
max_batch_size = input.size(0) if self.batch_first else input.size(1)
TypeError: 'int' object is not callable
I've also tried the following and now I'm getting "RuntimeError: input.size(-1) must be equal to input_size. Expected 80, got 386":
mel_npy: array = numpy.load(os.path.join(cwd, "tmp.npy"))
mel_npy = mel_npy.reshape((1, mel_npy.shape[0], mel_npy.shape[1]))
mel_tensor: Tensor = torch.tensor(mel_npy).to("cuda")
print(mel_tensor.shape)
torch.Size([1, 80, 386])
Traceback (most recent call last):
File "/home/muksihs/git/Cherokee-TTS/tts-wrapper/uv.py", line 35, in <module>
main()
File "/home/muksihs/git/Cherokee-TTS/tts-wrapper/uv.py", line 28, in main
wav, sr = vocoder.generate(mel_tensor)
File "/home/muksihs/miniconda3/envs/UniversalVocoding/lib/python3.9/site-packages/univoc/model.py", line 102, in generate
mel, _ = self.rnn1(mel)
File "/home/muksihs/miniconda3/envs/UniversalVocoding/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/home/muksihs/miniconda3/envs/UniversalVocoding/lib/python3.9/site-packages/torch/nn/modules/rnn.py", line 835, in forward
self.check_forward_args(input, hx, batch_sizes)
File "/home/muksihs/miniconda3/envs/UniversalVocoding/lib/python3.9/site-packages/torch/nn/modules/rnn.py", line 229, in check_forward_args
self.check_input(input, batch_sizes)
File "/home/muksihs/miniconda3/envs/UniversalVocoding/lib/python3.9/site-packages/torch/nn/modules/rnn.py", line 205, in check_input
raise RuntimeError(
RuntimeError: input.size(-1) must be equal to input_size. Expected 80, got 386
I finally figured out it needed a transpose, but, the generated wav is all silence?
mel_npy: array = numpy.load(os.path.join(cwd, "tmp.npy")).transpose()
mel_npy = mel_npy.reshape((1, mel_npy.shape[0], mel_npy.shape[1]))
mel_tensor: Tensor = torch.tensor(mel_npy).to("cuda")
# generate waveform
with torch.no_grad():
wav, sr = vocoder.generate(mel_tensor)
# save output
sf.write(os.path.join(cwd, "tmp.wav"), wav, sr)
The following seems to work. Definitely different sounding...
mel_npy: array = numpy.load(os.path.join(cwd, "tmp.npy")).transpose()
top_db = 80
mel_npy = numpy.maximum(mel_npy, -top_db)
mel_npy = mel_npy / top_db
mel_tensor: Tensor = torch.FloatTensor(mel_npy).unsqueeze(0).to("cuda")
# generate waveform
with torch.no_grad():
wav, sr = vocoder.generate(mel_tensor)
# save output
sf.write(os.path.join(cwd, "tmp.wav"), wav, sr)
Hi @michael-conrad,
Apart from the normalization steps, the parameters used to extract the mel-spectorgram need to be the same as the ones used in this repo. From a cursory glance at https://github.com/Tomiinek/Multilingual_Text_to_Speech it looks like their model is trained on spectrograms from 22050Hz audio with a different hop-length and window-length to what I used here.
To fix this you have two options: 1. retrain the vocoder (with some minor modifications) using their spectrograms as the input; or 2. retrain the acoustic model at https://github.com/Tomiinek/Multilingual_Text_to_Speech to produce spectrograms with matching parameters.
I'm prepping to run another test with a fork of it.
I'm looking in https://github.com/CherokeeLanguage/Cherokee-TTS/blob/master/params/params.py and trying to figure out what to change. I see there is a normalize setting. I think the script https://github.com/CherokeeLanguage/Cherokee-TTS/blob/master/data/prepare_spectrograms.py handles that. I should figure out how to normalize in this file to match the vocoder?
"""
******************** PARAMETERS OF AUDIO ********************
"""
sample_rate = 22050 # sample rate of source .wavs, used while computing spectrograms, MFCCs, etc.
num_fft = 1102 # number of frequency bins used during computation of spectrograms
num_mels = 80 # number of mel bins used during computation of mel spectrograms
num_mfcc = 13 # number of MFCCs, used just for MCD computation (during training)
stft_window_ms = 50 # size in ms of the Hann window of short-time Fourier transform, used during spectrogram computation
stft_shift_ms = 12.5 # shift of the window (or better said gap between windows) in ms
griffin_lim_iters = 60 # used if vocoding using Griffin-Lim algorithm (synthesize.py), greater value does not make much sense
griffin_lim_power = 1.5 # power applied to spectrograms before using GL
normalize_spectrogram = True # if True, spectrograms are normalized before passing into the model, a per-channel normalization is used
# statistics (mean and variance) are computed from dataset at the start of training
use_preemphasis = True # if True, a preemphasis is applied to raw waveform before using them (spectrogram computation)
preemphasis = 0.97 # amount of preemphasis, used if use_preemphasis is True