Reproducibility problems with Librispeech model
I want to thank all the authors for the great work that they have done with this paper.
I am trying to reproduce the Librispeech model training to get a better sense of how the model is training in the hopes of building a 25Hz version of xcodec in the future.
I downloaded all the 960h Librispeech training from here and kept the config of the model as it is. I only changed batch size from 8 in 8 GPUs to 16 in 4 GPUs.
The problem I am running into is that the training is not stable. It seems to me that the GAN setting is difficult to train and is the main culprit of this.
I just wanted to ask if you have experienced this during the experiments and how you dealt with this. I am almost tempted to just resume the training from an earlier checkpoint. It would be really helpful if you guys can guide me here.
Thank you and I appreciate the time you've taken to read this!
Hi, bro. Can I add your WeChat and talk about some questions with you!
Hi, i tried a lot on low-bitrate codec recently. For 25hz codec, maybe you can try vocos (iSTFT) decoder [1] since the model does not need to learn temporal upsampling. In addition, I will release a low-bitrate xcodec next month.
[1] Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis
@zhenye234 thank you for responding. I did notice that audio reconstruction was very good when using only 1 RVQ layers from the 8 quantizers available. I was wondering what the cause for that might be and if that is an intended result?
I noticed you mention doing some kind of "dropout" of the quantizer layers. (i.e: randomly selecting RVQ layers from options [1, 2, 3, 4, 8]). However, It doesn't seem to me that having that alone allows you to do audio reconstruction with 1 RVQ layer.
Have you solve the issue? I also met the same issue using the original setting.
Hi, bro. Can I add your WeChat and talk about some questions with you!
Hi, have you solved the issue? I also met the same issue using the original setting.