Amphion
Amphion copied to clipboard
Why are Facodec and Ns3_facodec different?
I am looking at that Model code of 2 folders facodec and ns3_facodec. I know that ns3_facodec is the training code for Facodec. However, I am witnessing some differences between 2 architecture:
- First of all, there are no LSTMs in the official Facodec in both Encoder and Decoder
- Secondly, the timbre encoder is kinda different. Even though both are using Transformer, I am seeing that they are not the same.
- The generator loss is the combination of multiple losses by some weights. But as I look at the NaturalSpeech3 paper at the Appendix part, it is clearly that the weights are not like in the paper, rather than the DAC paper
- The upsample and downsample rates are not the same. For the official Ns3_codec, it is [2, 4, 5, 5] while the other one is [2,4, 8, 8]. This also means the hop_lengths for melspectrogram are 200 and 300, respectively
- In the training code, the audio data has sampling rate of 24k Hz while the original paper performs on 16k Hz audio