FAcodec What do the loss curves look like during your successful training?

Hello,

I've attempted to train FAcodec using my own dataset. However, whether I start from scratch or fine-tune your provided checkpoint, the reconstructed audio clips are just noise. I fine-tuned the model using around 128 hours of Common Voice 18 ZH-TW data. After approximately 20k steps, the loss seemed to converge. Some losses, like feature loss, decreased successfully, while others, such as mel loss and waveform loss, were oscillating.

Do all losses decrease during your training process?

Jul 29 '24 06:07 YuXiangLin1234

Could you please share your voice examples and loss curves? I believe they can help for analyzing the issue you encountered

Jul 29 '24 15:07 Plachtaa

The loss curve looks like:

The audio samples are as follows: https://huggingface.co/datasets/mozilla-foundation/common_voice_16_0/viewer/zh-TW

The reconstructed audio sample: https://drive.google.com/file/d/1yk_xZL17FkhIYMjojesd-PHWyAKuqzSA/view?usp=sharing

Jul 29 '24 18:07 YuXiangLin1234

According to the mel_loss in the loss curve you shared, the model seems to have converged well. However, the reconstructed audio samples sounds to be generated by a randomly initialized model. May I know whether the reconstructed sample is retrieved from tensorboard or through another reconstruction script?

Jul 30 '24 08:07 Plachtaa

Hi there! Thank you for the training code. The original paper doesn't provide this but the model and checkpoint; hence this is very helpful. But I am facing some training problem.

I am trying to train this model from scratch. However, instead of using the provided code, I have changed it so it is more similar to the original paper. Where are some modifications that I have done:

Loss function weights so it matches the Appendix part
Audio sample rate. It seems like this code is using sample rate of 24kHz, so I changed it to 16kHz like in the paper
Hop length and down-sample rate. Since the audio's sample rate is modified, I also need to change those 2 to be 200 and [2, 4, 5, 5], respectively. In addition, my n_mels is 128 instead of 80.
Pitch extractor. The code is using pretrained JDC model to predict F0 for label. Therefore, I used to original approach that this PitchExtractor model used to train
Phoneme extractor. I am seeing that this code is utilizing Wav2Vec model to get label for phoneme quantizer. However, I am using the Montreal Forced Aligner instead. This means phonemes are in CMU format, not IPA like Wav2vec
The max frame used for training in this code is 80. I see that a bit small, so I increase it to 512 frames.

When use these modifications to train the model, it doesn't converge, the output is nonsense, all phonemes are the same in the Predictor and the codebook loss is huge. Can anybody help me fix this problem?

Nov 17 '24 14:11 ndhuynh02

Hi there! Thank you for the training code. The original paper doesn't provide this but the model and checkpoint; hence this is very helpful. But I am facing some training problem.

I am trying to train this model from scratch. However, instead of using the provided code, I have changed it so it is more similar to the original paper. Where are some modifications that I have done:

Loss function weights so it matches the Appendix part

Audio sample rate. It seems like this code is using sample rate of 24kHz, so I changed it to 16kHz like in the paper

Hop length and down-sample rate. Since the audio's sample rate is modified, I also need to change those 2 to be 200 and [2, 4, 5, 5], respectively. In addition, my n_mels is 128 instead of 80.

Pitch extractor. The code is using pretrained JDC model to predict F0 for label. Therefore, I used to original approach that this PitchExtractor model used to train

Phoneme extractor. I am seeing that this code is utilizing Wav2Vec model to get label for phoneme quantizer. However, I am using the Montreal Forced Aligner instead. This means phonemes are in CMU format, not IPA like Wav2vec

The max frame used for training in this code is 80. I see that a bit small, so I increase it to 512 frames.

When use these modifications to train the model, it doesn't converge, the output is nonsense, all phonemes are the same in the Predictor and the codebook loss is huge. Can anybody help me fix this problem?

I understand your thoughts but I strongly recommend you to start from existing code that has been proved to be working, then you can make your desired changes step by step, or else it's impossible to find the cause

Nov 17 '24 14:11 Plachtaa