xcodec icon indicating copy to clipboard operation
xcodec copied to clipboard

Reproducibility problems with Librispeech model

Open Vanlogh opened this issue 1 year ago • 5 comments

I want to thank all the authors for the great work that they have done with this paper.

I am trying to reproduce the Librispeech model training to get a better sense of how the model is training in the hopes of building a 25Hz version of xcodec in the future.

I downloaded all the 960h Librispeech training from here and kept the config of the model as it is. I only changed batch size from 8 in 8 GPUs to 16 in 4 GPUs.

The problem I am running into is that the training is not stable. It seems to me that the GAN setting is difficult to train and is the main culprit of this.
image image

I just wanted to ask if you have experienced this during the experiments and how you dealt with this. I am almost tempted to just resume the training from an earlier checkpoint. It would be really helpful if you guys can guide me here.

Thank you and I appreciate the time you've taken to read this!

Vanlogh avatar Oct 23 '24 21:10 Vanlogh

Hi, bro. Can I add your WeChat and talk about some questions with you!

ooooolong avatar Oct 30 '24 09:10 ooooolong

Hi, i tried a lot on low-bitrate codec recently. For 25hz codec, maybe you can try vocos (iSTFT) decoder [1] since the model does not need to learn temporal upsampling. In addition, I will release a low-bitrate xcodec next month.

[1] Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis

zhenye234 avatar Nov 04 '24 14:11 zhenye234

@zhenye234 thank you for responding. I did notice that audio reconstruction was very good when using only 1 RVQ layers from the 8 quantizers available. I was wondering what the cause for that might be and if that is an intended result?

I noticed you mention doing some kind of "dropout" of the quantizer layers. (i.e: randomly selecting RVQ layers from options [1, 2, 3, 4, 8]). However, It doesn't seem to me that having that alone allows you to do audio reconstruction with 1 RVQ layer.

Vanlogh avatar Nov 20 '24 17:11 Vanlogh

Have you solve the issue? I also met the same issue using the original setting.

morganshi avatar Mar 26 '25 19:03 morganshi

Hi, bro. Can I add your WeChat and talk about some questions with you!

Hi, have you solved the issue? I also met the same issue using the original setting.

morganshi avatar Mar 26 '25 20:03 morganshi