UniversalVocoding icon indicating copy to clipboard operation
UniversalVocoding copied to clipboard

Generate audio from mag spectrogram

Open tunnermann opened this issue 5 years ago • 5 comments

Hey, thanks for your work in this project, it is really good.

I'm trying to use this vocoder to generate wavs from magnitude spectrograms I generated using another neural network. Using griffin-lim gets me a nice audio, but kind of robotic, so I think your vocoder will improve it a lot.

The biggest difference between the parameters of the two networks are in n_ftt, my spectrograms use 1024 and your network use 2048. So, if I try to use your pre-trained model, changing only n_ftt the resulting audio is sped up a bit and the voice gets really high.

I tryed retraining the network changing only n_ftt, but the results where not good, it got a lot of noise.

Any leads on what I might try next?

tunnermann avatar Jul 13 '19 13:07 tunnermann

Hi @tunnermann, no problem.

I've just done a bit of testing. Passing a mel spectrogram with num_fft = 1024 to the pretrained model does result in some distortion of the audio. However, when I changed num_fft in the config.json and retrained the model from scratch I got fairly good results. Here are some samples: samples.zip.

Did you do anything else besides changing the one line in config.json?

Also, I'd be happy to share the weights for this model with you if you'd like?

bshall avatar Jul 14 '19 07:07 bshall

@bshall Thanks for your reply.

I did retrain the model with the new n_fft and got good results generating audio from wav files. Maybe my problem is in converting my spectrogram into mel spectrograms and feeding it to the network. I will investigate it further and also retrain the network directly with the generated spectrograms instead of spectrograms derived from the ground truth audio.

Thanks again.

tunnermann avatar Jul 15 '19 21:07 tunnermann

Yeah, that sounds like a reasonable approach. Let me know how it goes or if I can help at all. You can also try finetuning the model on the generated spectrograms. Might make experimenting a little faster.

bshall avatar Jul 16 '19 08:07 bshall

Hi,@bshall @tunnermann I met the same problem, when I use different parameters to extract mel spectrogram and retrain the model, but the loss stop arround 2.9 and the result has load noise. What can I do to adjust the model to get a better performance? Here is my config parameters and audio samples. I use several dataset incluing multiple langualges. "preprocessing": { "sample_rate": 16000, "num_fft": 1024, "num_mels": 80, "fmin": 40, "preemph": 0.97, "min_level_db": -100, "hop_length": 256, "win_length": 1024, "bits": 9, "num_evaluation_utterances" : 10 }, "vocoder": { "conditioning_channels": 128, "embedding_dim": 256, "rnn_channels": 896, "fc_channels": 512, "learning_rate": 1e-4, "schedule": { "step_size": 20000, "gamma": 0.5 }, "batch_size": 256, "checkpoint_interval": 10000, "num_steps": 5000000, "sample_frames": 40, "audio_slice_frames": 8 } audio_samples.zip

Approximetal avatar Apr 13 '20 08:04 Approximetal

Hi @Approximetal,

My guess is that a hop-length of 256 is too large for a sample rate of 16kHz. At this hop-length each frame is 16ms of audio. Most TTS and vocoder implementations that I've seen use either 12.5ms or 10ms. The ones that use a hop-length of 256 typically have audio at a sample rate of 22050.

The ZeroSpeech2019 dataset is only recorded at 16kHz so my default was a hop-length of 200 (12.5ms).

Hope that helps!

bshall avatar Apr 14 '20 08:04 bshall