audio
audio copied to clipboard
`torchaudio.transforms.InverseMelScale` does not work
🐛 Describe the bug
I am trying to reconstruct a waveform by composing the InverseMelScale transform and the GriffinLim transform. The operation hangs while running InverseMelScale. I switched to the librosa version librosa.features.inverse.mel_to_audio which ran without any problems. I have not had any success with the InverseMelScale transform and I have had to kill the process for taking too long.
import librosa
import torch
import torch.nn as nn
import torchaudio
SAMPLE_FILE = 'samples/my_sample.mp3'
waveform, sample_rate = torchaudio.load(SAMPLE_FILE,
num_frames=220_500,
frame_offset=0)
waveform_to_mel_spectrogram = torchaudio.transforms.MelSpectrogram(
sample_rate=sample_rate, n_fft=1024, hop_length=256, n_mels=80)
mel_scale_to_power = torchaudio.transforms.InverseMelScale(
sample_rate=sample_rate, n_stft=1024, n_mels=80)
power_spec_to_waveform = torchaudio.transforms.GriffinLim(n_fft=1024,
hop_length=256)
mel_spectrogram_to_waveform = nn.Sequential(mel_scale_to_power,
power_spec_to_waveform)
mel_spectrogram = waveform_to_mel_spectrogram(waveform)
# ERROR: Hangs here!!!
reconstructed_waveform = mel_spectrogram_to_waveform(mel_spectrogram)
# Runs smoothly with no errors
reconstructed_waveform = librosa.feature.inverse.mel_to_audio(
mel_spectrogram.numpy(), sr=sample_rate, n_fft=1024, hop_length=256)
I have tried running the InverseMelScale transform without wrapping it within the nn.Sequential layer and the results are the same.
Versions
PyTorch version: 1.12.0 Is debug build: False CUDA used to build PyTorch: None ROCM used to build PyTorch: N/A
OS: KDE neon User - 5.25 (x86_64) GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0 Clang version: Could not collect CMake version: version 3.23.2 Libc version: glibc-2.31
Python version: 3.10.5 | packaged by conda-forge | (main, Jun 14 2022, 07:04:59) [GCC 10.3.0] (64-bit runtime) Python platform: Linux-5.15.0-41-generic-x86_64-with-glibc2.31 Is CUDA available: False CUDA runtime version: No CUDA GPU models and configuration: No CUDA Nvidia driver version: No CUDA cuDNN version: No CUDA HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True
Versions of relevant libraries: [pip3] numpy==1.22.4 [pip3] pytorch-lightning==1.6.5 [pip3] torch==1.12.0 [pip3] torchaudio==0.12.0 [pip3] torchinfo==1.7.0 [pip3] torchmetrics==0.9.3 [pip3] torchvision==0.13.0 [conda] blas 2.115 mkl conda-forge [conda] blas-devel 3.9.0 15_linux64_mkl conda-forge [conda] cpuonly 2.0 0 pytorch [conda] libblas 3.9.0 15_linux64_mkl conda-forge [conda] libcblas 3.9.0 15_linux64_mkl conda-forge [conda] liblapack 3.9.0 15_linux64_mkl conda-forge [conda] liblapacke 3.9.0 15_linux64_mkl conda-forge [conda] mkl 2022.1.0 h84fe81f_915 conda-forge [conda] mkl-devel 2022.1.0 ha770c72_916 conda-forge [conda] mkl-include 2022.1.0 h84fe81f_915 conda-forge [conda] numpy 1.22.4 py310h4ef5377_0 conda-forge [conda] pytorch 1.12.0 py3.10_cpu_0 pytorch [conda] pytorch-lightning 1.6.5 pyhd8ed1ab_0 conda-forge [conda] pytorch-mutex 1.0 cpu pytorch [conda] torchaudio 0.12.0 py310_cpu pytorch [conda] torchinfo 1.7.0 pyhd8ed1ab_0 conda-forge [conda] torchmetrics 0.9.3 pyhd8ed1ab_0 conda-forge [conda] torchvision 0.13.0 py310_cpu pytorch
Hi @Kinyugo, sorry for my late response. I can reproduce the hanging issue.
I think the main cause is that there is max_iter parameter in InverseMelScale that is set to 100000 by default, if the loss value doesn't converge to lower than tolerance_loss or the loss change is larger than tolerance_change, the optimization will keep running till max_iter iterations.
https://github.com/pytorch/audio/blob/776cf0990b19dd568a845bbc1bf0c57f278512f5/torchaudio/transforms/_transforms.py#L492-L505
You can try by reducing the max_iter in the module, but I'm not sure how the estimated waveform will sound like. You can tune the parameters based on that.
Another thing is the n_stft in InverseMelScale needs to be changed to 513, otherwise the output is not compatible with GriffinLim.
mel_scale_to_power = torchaudio.transforms.InverseMelScale(
sample_rate=sample_rate, n_stft=513, n_mels=80)
That could also be the cause since you increase the number of STFTs in the output, that makes it more difficult to estimate.
Hello ✋🏿 Thank you for your insights. I have made the changes and I can confirm that it works. I see that torchaudio took quite a different approach to the InverseMelScale. The librosa one runs quite fast, and has bearable results that are useful for debugging. Is it possible to implement the same in torchaudio? And do you have some pointers on how one can go about it?
librosa uses the non-negative least squares (nnls) algorithm to estimate the spectrogram. there was discussions about how to implement nnls in torchaudio.
The main blocker is the L-BFGS-B optimizer is not yet supported in PyTorch, that is the optimization algorithm in librosa's nnls.
Another workaround is using torch.linalg.lstsq to replace the current optimization. I have tested it locally and it can pass the unit test. The driver needs to be set to "gelsd" in order to improve the precision. To proceed with this solution, we need to run some benchmark to make sure the estimated spectrogram is close to original and there is runtime speedup.
Thank you for your time. I will stick with GriffinLim with a few iterations for now. I might also look into neural vocoding approaches. Feel free to reach out to me if that is something that might be of interest to the torchaudio community, I would love to contribute.
Hi @Kinyugo, that sounds good. I will propose the lstsq solution for InverseMelScale in a new issue and let you know the update on that.
Regarding the neural vocoding approach, torchaudio has WaveRNN neural vocoder and pretrained models. Feel free to try it or other neural vocoders and see which one has the best performance for you.