fairseq
fairseq copied to clipboard
Textless NLP / GSLM: Speech resynthesis produces silent .wav output
❓ Questions and Help
What is your question?
Hello, I'm currently repeating the tutorial, and struggling with a problem in which examples/textless_nlp/gslm/tools/resynthesize_speech.py
is producing a file that is completely silent.
I don't think that the problem is happening during WaveGlow(Vocoder) step, as mel-spectrogram from Tacotron2 (var mel in /examples/textless_nlp/gslm/unit2speech/utils.py
) shows no output. Also, it seems like that there is no problem in km.bin
as it produces different length of sound file depending on the input file length.
I was not sure whether I'm having a dependency or package issue(such as CUDA), so I re-produced these steps with various environments. However both new environments using Anaconda(torch1.12.1+cuda11.3) and Google Colab(torch1.12.1+cuda10.1) showed the same result.
I'm attaching the input file, output file, and following mel-spectrogram output below. Do you have any assumption why the problem is happening?
Thanks a lot!
Code
-
Downloaded pre-trained models from repo (HuBERT-km200 in this example)
- acoustic model
- k-means model
- tts checkpoint model
- code dict
- vocoder (waveglow)
-
get a sample voice file (LJspeech for this example) 84-121123-0005.flac
-
in
resynthesize_speech.py
I added the code to plot mel-spectrogram:
import matplotlib.pyplot as plt
import numpy as np
from IPython.display import Audio
import librosa
from torchaudio.utils import download_asset
def plot_spectrogram(specgram, title=None, ylabel="freq_bin"):
fig, axs = plt.subplots(1, 1)
axs.set_title(title or "Spectrogram (db)")
axs.set_ylabel(ylabel)
axs.set_xlabel("frame")
im = axs.imshow(librosa.power_to_db(specgram), origin="lower", aspect="auto")
fig.colorbar(im, ax=axs)
plt.savefig('figure01.jpg')
while(True):
# ~~~
plot_spectrogram(mel[0].cpu().float().numpy(), title="MelSpectrogram - torchaudio", ylabel="mel freq")
# ~~~
- run bash
export FAIRSEQ_ROOT=/home/ubuntu/fairseq
export DATA=/home/my/path/models
PYTHONPATH=${FAIRSEQ_ROOT}:${FAIRSEQ_ROOT}/examples/textless_nlp/gslm/unit2speech python ${FAIRSEQ_ROOT}/examples/textless_nlp/gslm/tools/resynthesize_speech.py \
--feature_type 'hubert' \
--layer 6 \
--acoustic_model_path $DATA/hubert_base_ls960.pt \
--kmeans_model_path $DATA/km.bin \
--tts_model_path $DATA/tts_checkpoint_best.pt \
--code_dict_path $DATA/code_dict.txt \
--waveglow_path $DATA/waveglow_256channels_new.pt \
--max_decoder_steps 2000
- checked mel-spectrogram: no waveform is shown
What have you tried?
I also tried with other models (CPC and wav2vec until now) and different K-Means models.
I also (and originally) tried to produce a wav file using gslm/unit2speech/synthesize_audio_from_units.py
. It also shows same result: no sound output(silent).
What's your environment?
(main environment)
- fairseq Version (e.g., 1.0 or main): main(0.12.2)
- PyTorch Version (e.g., 1.0) 1.12.1
- OS (e.g., Linux): Ubuntu 20.04.5 LTS (GNU/Linux 5.4.0-125-generic x86_64)
- How you installed fairseq (pip, source):
git clone https://github.com/pytorch/fairseq
- Build command you used (if compiling from source):
pip install --editable ./
- Python version: 3.9.13
- CUDA/cuDNN version: Build cuda_11.3.r11.3/compiler.29745058_0
- GPU models and configuration: Geforce GTX 3090 (NVIDIA Corporation Device 2204 (rev a1))
- Any other relevant information: -
I have met the same problem. Have you solved it?
Hi, I find the waveglow model generates nan tensors, which leads to the silent output. I fixed this issue by using fp32. You can try to remove .half() in load_waveglow and load_tacotron functions. Hope this can help you.
Hi, I find the waveglow model generates nan tensors, which leads to the silent output. I fixed this issue by using fp32. You can try to remove .half() in load_waveglow and load_tacotron functions. Hope this can help you.
That method completely works! Thanks a lot for solving my problem 👍👍
AXS