Why are Facodec and Ns3_facodec different?

Open ndhuynh02 opened this issue 1 year ago • 0 comments

I am looking at that Model code of 2 folders facodec and ns3_facodec. I know that ns3_facodec is the training code for Facodec. However, I am witnessing some differences between 2 architecture:

First of all, there are no LSTMs in the official Facodec in both Encoder and Decoder
Secondly, the timbre encoder is kinda different. Even though both are using Transformer, I am seeing that they are not the same.
The generator loss is the combination of multiple losses by some weights. But as I look at the NaturalSpeech3 paper at the Appendix part, it is clearly that the weights are not like in the paper, rather than the DAC paper
The upsample and downsample rates are not the same. For the official Ns3_codec, it is [2, 4, 5, 5] while the other one is [2,4, 8, 8]. This also means the hop_lengths for melspectrogram are 200 and 300, respectively
In the training code, the audio data has sampling rate of 24k Hz while the original paper performs on 16k Hz audio

Nov 10 '24 02:11 ndhuynh02