Voice cloning in fine-tuning
Does the fine-tuning script also optimises the model for zeroshot voice cloning? Or do we need another script for that?
Can you also share the pretraining code and configs please?
Hi, since our model architecture is autoregressive, the ability for zeroshot voice cloning is an inherent feature of the base model. You don't need a separate script for this. The fine-tuning script we've shared are actually an "extra" toolset provided for those who wish to achieve near-perfect fidelity for a specific individual voice.
Interesting. I was trying to add instruction following to the model, and it started to bleed the instruction in the output audio itself. Very weird.