CosyVoice How to add other language to CosyVoice2

Hi, i would like to add Spanish (and a few more) language(s) to CosyVoice2 model. My data is in ljspeech style. How many models do i have to train and how long will it last using 8 A30 gpus? Is the training code available?

Jan 03 '25 12:01 ukemamaster

cosyvoice2 training recipie is not ready yet

Jan 04 '25 15:01 aluminumbox

cosyvoice2 training recipie is not ready yet

When is it expected to be released?

Jan 04 '25 16:01 ukemamaster

Any updates on this?

Jan 21 '25 13:01 Ryu1845

Any updates on this? Is the training recipe available?

Mar 05 '25 08:03 ukemamaster

@ukemamaster they shared the training code in different branch. However, I got better results from Cosy1. I couldn't find the reason, but Cosy2 overfits immediately.

Mar 05 '25 13:03 EmreOzkose

@EmreOzkose Great. can you share your fine-tuning experience, and code (if possible)? Which new language you have trained for? have you got expected results? How about latency? is it real time?

Mar 06 '25 12:03 ukemamaster

As I said, authors shared code here. You can check libritts example. You just need to prepare the data.

I trained 4 English + 1 Spanish speakers (together or only English). Trainings are done in 1 one day with maximum with 24 GB GPU memory. RTF is ~0.4-~0.5 on GPU. When I tried to fine-tune cosy_v1-300M or cosy_v2-500M, they both overfitted so early. However, cosy_v1-300M overfits a little bit later. Hence, 300M learns my speakers better. Streaming is supported, but I didn't have much experience. Hence, we can say it is real time on GPU. The best thing is that the training takes too short (max 1 day) and you get a good enough multispeaker TTS. If you train multilingual, cosy1-300M is also better for my experiences. Acoustic similarity is really good, but there are phonetic errors in generations. In cosy2, authors removed the text encoder, which might cause phonetic error issues.

Mar 07 '25 08:03 EmreOzkose

@EmreOzkose Thanks for your detailed answer. How is the performance in Spanish? Can you please share your model weights and inference code? Actually i tried with the pre-trained CosyVoice2-0.5B model, and the rtf is always above 1. I am not sure if it depends on GPU. i have NVIDIA A30 GPU with 24GB memory.

Mar 07 '25 09:03 ukemamaster

Inference code is

from cosyvoice.cli.cosyvoice import CosyVoice2
model_folder = "exps/exp1/cosyvoice2/llm/torch_ddp/epoch_0_whole.pt"
cosyvoice = CosyVoice2(save_folder, load_jit=False, load_trt=False, fp16=False)

print(cosyvoice.list_available_spks())

start = time.time()

spkr_name, text = "john", "hello"
result = next(cosyvoice.inference_sft(text, spkr_name, stream=False))
save_path = f'sentences_0.wav'
torchaudio.save(save_path, result['tts_speech'], cosyvoice.sample_rate)

end = time.time()
rtf = (end - start) / sf.info(save_path).duration
print(rtf)

I am not a Spanish native, hence I am not able to detect phoneme errors accurately, but sounds good. My GPU is A5000. I think yours is better. My RTF measurements are done for Cosy_v1-300M.

Mar 07 '25 10:03 EmreOzkose

@EmreOzkose thanks for providing details. Did you train with neutral data or emotionally labeled data? I wonder if trained with neutral data for new language, will the model retain its original capabilities like taking instructions and tts in chinese language? Or it will forget? How is it in your case?

Mar 08 '25 05:03 ukemamaster

@ukemamaster I am planning to investigate. I will write observations.

Mar 08 '25 15:03 EmreOzkose

@EmreOzkose, @ukemamaster Hi, any news on whether the instructions worked after fine-tuning? Or how good the results can be on non-officially supported languages? I'd guess a fewshot fine-tuned Spanish generation would be very bad, right?

Until CosyVoice3 gets open-sourced, which seems to have improved significantly multiple language support, this is the main limitation of the model. Here's a cool ongoing project I have found trying to enhance language support of Cosyvoice2 and some strategies they use: https://horstmann.tech/cosyvoice2-demo/#

Also, could you provide any insights on how did you prepare the data for fine-tuning?

Nov 28 '25 12:11 AlbertoAltozano

Hi @AlbertoAltozano,

I couldn't train cosy2 with instructions, but I changed language tokens with speaker names like

'<en> I want to leave here' -> '<john> I want to leave here'. In this way, model learns speaker style. How much is it good? So so. Maybe if you train with instruction style, it will be better (certainly worth to try) like

john <|endofprompt|> <en> I want to leave here (I am not sure if lang token should be there)

Dec 01 '25 06:12 EmreOzkose