Voice cloning in fine-tuning

Open tensorjackal opened this issue 1 month ago • 1 comments

Does the fine-tuning script also optimises the model for zeroshot voice cloning? Or do we need another script for that?

Can you also share the pretraining code and configs please?

Jan 15 '26 21:01 tensorjackal

Hi, since our model architecture is autoregressive, the ability for zeroshot voice cloning is an inherent feature of the base model. You don't need a separate script for this. The fine-tuning script we've shared are actually an "extra" toolset provided for those who wish to achieve near-perfect fidelity for a specific individual voice.

Jan 16 '26 08:01 liuxin99

Interesting. I was trying to add instruction following to the model, and it started to bleed the instruction in the output audio itself. Very weird.

Jan 26 '26 15:01 tensorjackal