fairseq icon indicating copy to clipboard operation
fairseq copied to clipboard

Textless NLP / GSLM: Speech resynthesis produces silent .wav output

Open nonmetal opened this issue 2 years ago • 4 comments

❓ Questions and Help

What is your question?

Hello, I'm currently repeating the tutorial, and struggling with a problem in which examples/textless_nlp/gslm/tools/resynthesize_speech.py is producing a file that is completely silent.

I don't think that the problem is happening during WaveGlow(Vocoder) step, as mel-spectrogram from Tacotron2 (var mel in /examples/textless_nlp/gslm/unit2speech/utils.py) shows no output. Also, it seems like that there is no problem in km.bin as it produces different length of sound file depending on the input file length.

I was not sure whether I'm having a dependency or package issue(such as CUDA), so I re-produced these steps with various environments. However both new environments using Anaconda(torch1.12.1+cuda11.3) and Google Colab(torch1.12.1+cuda10.1) showed the same result.

I'm attaching the input file, output file, and following mel-spectrogram output below. Do you have any assumption why the problem is happening?

Thanks a lot!

Code

  1. Downloaded pre-trained models from repo (HuBERT-km200 in this example)

    • acoustic model
    • k-means model
    • tts checkpoint model
    • code dict
    • vocoder (waveglow)
  2. get a sample voice file (LJspeech for this example) 84-121123-0005.flac

  3. in resynthesize_speech.py I added the code to plot mel-spectrogram:

import matplotlib.pyplot as plt
import numpy as np
from IPython.display import Audio
import librosa
from torchaudio.utils import download_asset

def plot_spectrogram(specgram, title=None, ylabel="freq_bin"):
    fig, axs = plt.subplots(1, 1)
    axs.set_title(title or "Spectrogram (db)")
    axs.set_ylabel(ylabel)
    axs.set_xlabel("frame")
    im = axs.imshow(librosa.power_to_db(specgram), origin="lower", aspect="auto")
    fig.colorbar(im, ax=axs)
    plt.savefig('figure01.jpg')

while(True):
    # ~~~
    plot_spectrogram(mel[0].cpu().float().numpy(), title="MelSpectrogram - torchaudio", ylabel="mel freq")
    # ~~~

  1. run bash
export FAIRSEQ_ROOT=/home/ubuntu/fairseq
export DATA=/home/my/path/models

PYTHONPATH=${FAIRSEQ_ROOT}:${FAIRSEQ_ROOT}/examples/textless_nlp/gslm/unit2speech python ${FAIRSEQ_ROOT}/examples/textless_nlp/gslm/tools/resynthesize_speech.py \
    --feature_type 'hubert' \
    --layer 6 \
    --acoustic_model_path $DATA/hubert_base_ls960.pt \
    --kmeans_model_path $DATA/km.bin \
    --tts_model_path $DATA/tts_checkpoint_best.pt \
    --code_dict_path $DATA/code_dict.txt \
    --waveglow_path $DATA/waveglow_256channels_new.pt \
    --max_decoder_steps 2000

  1. checked mel-spectrogram: no waveform is shown

mel

What have you tried?

I also tried with other models (CPC and wav2vec until now) and different K-Means models. I also (and originally) tried to produce a wav file using gslm/unit2speech/synthesize_audio_from_units.py. It also shows same result: no sound output(silent).

What's your environment?

(main environment)

  • fairseq Version (e.g., 1.0 or main): main(0.12.2)
  • PyTorch Version (e.g., 1.0) 1.12.1
  • OS (e.g., Linux): Ubuntu 20.04.5 LTS (GNU/Linux 5.4.0-125-generic x86_64)
  • How you installed fairseq (pip, source): git clone https://github.com/pytorch/fairseq
  • Build command you used (if compiling from source): pip install --editable ./
  • Python version: 3.9.13
  • CUDA/cuDNN version: Build cuda_11.3.r11.3/compiler.29745058_0
  • GPU models and configuration: Geforce GTX 3090 (NVIDIA Corporation Device 2204 (rev a1))
  • Any other relevant information: -

nonmetal avatar Oct 05 '22 08:10 nonmetal

I have met the same problem. Have you solved it?

cywang97 avatar Oct 17 '22 07:10 cywang97

Hi, I find the waveglow model generates nan tensors, which leads to the silent output. I fixed this issue by using fp32. You can try to remove .half() in load_waveglow and load_tacotron functions. Hope this can help you.

cywang97 avatar Oct 17 '22 12:10 cywang97

Hi, I find the waveglow model generates nan tensors, which leads to the silent output. I fixed this issue by using fp32. You can try to remove .half() in load_waveglow and load_tacotron functions. Hope this can help you.

That method completely works! Thanks a lot for solving my problem 👍👍

nonmetal avatar Oct 20 '22 05:10 nonmetal

AXS

Wannacry1234455900 avatar Nov 05 '22 09:11 Wannacry1234455900