DiffGAN-TTS Issues with Audio Quality for Longer Text Inputs Using VCTK Pretrained Model

Issues with Audio Quality for Longer Text Inputs Using VCTK Pretrained Model

Open hungdinhxuan opened this issue 7 months ago • 5 comments

Hello @keonlee9420,

I've been working with the VCTK pretrained model provided in the GitHub repository and encountered some issues regarding the audio quality for longer text inputs. While the initial few seconds of the generated audio (approximately the first 3 seconds) are of high quality, the audio quality noticeably drops or becomes unnatural after this point. This occurs regardless of whether I use the naive, aux, or shallow methods.

In the paper "DiffGAN-TTS: High-Fidelity and Efficient Text-to-Speech with Denoising Diffusion GANs", a dataset of 228 Mandarin Chinese speakers with over 200 hours of speech data was used, but the GitHub implementation utilizes the VCTK dataset, which has around 44 hours of speech data. I'm curious if the difference in the dataset, both in terms of the length of individual speech samples and the total volume of data, might be influencing the quality of the generated audio, especially for longer text inputs.

I have a couple of specific questions:

Are there differences in the average length of speech samples between the Mandarin dataset used in the paper and the VCTK dataset used in the implementation?
Are there any inherent limitations in the model regarding the length of text input for maintaining high-quality audio synthesis?

I suspect that the discrepancies between the datasets might be impacting the model's performance, particularly for longer inputs. Any insights or suggestions would be greatly appreciated.

Thank you for your work on this project and for any assistance you can provide.

Nov 17 '23 05:11 hungdinhxuan

I've met the same problem while trying to apply the code to LibriTTS dataset. Whenever training or inferencing, the generated samples sound quite clear at the first 3 secs but expierence a notable quality degradation after this point (by comparing the samples contained in the following zip samples.zip). Can anyone provide some insights into this problem or is there any possible solution?

Nov 20 '23 04:11 ZhengRachel

Hi @ZhengRachel, I see that you have met the same problem while trying to apply the code to LibriTTS dataset. I think the length of audio corpus of VCTK is short (average to 3 sec) and that may be the main reason for the unusual phenomenon, so i am intending to train with LibriTTS dataset and picking only longer audio. I wonder that in your training process do you try tuning any hyperparameters or keep the same configurations like VCTK to apply to LibriTTS dataset ?

Nov 20 '23 13:11 hungdinhxuan

@hungdinhxuan I didn't made any changes to the model configurations, but I made some minor changes while preprocessing the speech (though I don't think it would result in this phenomenon.) I tried to synthesize 16kHz speech, and the input mel-spectrogram was extracted with sr=16000, hop_length=196, win_length=1024 and fft_size=2048. Any other training, model and preprocessing configurations were just set to be the same as the default ones in this repository. Also, I used almost all the utterances in train-clean-100 and train-clean-360 to train the model regardless of their lengths. And the performance degradation was only noticed for those long utterances, e.g. over 5 secs. Would you mind sharing your results on LibriTTS after finishing the experiments?

Nov 20 '23 14:11 ZhengRachel

@ZhengRachel , I have picked random 2000 samples with average long utterances over 12 secs for the training process; I also tried to synthesize 16kHz speech and other configurations I have to keep the same as the default ones. I have trained on a single GPU with a 300k step, and the performance doesn't equal VCTK pre-trained. I want to perform this experiment again with my large custom dataset. However, I am having trouble training on multiple GPUs for the naive model. Would you mind sharing your code to train in multiple GPUs?

Nov 28 '23 14:11 hungdinhxuan

@hungdinhxuan It seems that the code provided in this repository already has a multi-gpu setting so actually I was just using this code. You can use the multi-gpus by simply setting CUDA_VISIBLE_DEVICES to multiple cudas.

Nov 29 '23 08:11 ZhengRachel

DiffGAN-TTS DiffGAN-TTS copied to clipboard

Issues with Audio Quality for Longer Text Inputs Using VCTK Pretrained Model

DiffGAN-TTS
DiffGAN-TTS copied to clipboard