How to add other language to CosyVoice2
Hi, i would like to add Spanish (and a few more) language(s) to CosyVoice2 model. My data is in ljspeech style. How many models do i have to train and how long will it last using 8 A30 gpus? Is the training code available?
cosyvoice2 training recipie is not ready yet
cosyvoice2 training recipie is not ready yet
When is it expected to be released?
Any updates on this?
Any updates on this? Is the training recipe available?
@ukemamaster they shared the training code in different branch. However, I got better results from Cosy1. I couldn't find the reason, but Cosy2 overfits immediately.
@EmreOzkose Great. can you share your fine-tuning experience, and code (if possible)? Which new language you have trained for? have you got expected results? How about latency? is it real time?
As I said, authors shared code here. You can check libritts example. You just need to prepare the data.
I trained 4 English + 1 Spanish speakers (together or only English). Trainings are done in 1 one day with maximum with 24 GB GPU memory. RTF is ~0.4-~0.5 on GPU. When I tried to fine-tune cosy_v1-300M or cosy_v2-500M, they both overfitted so early. However, cosy_v1-300M overfits a little bit later. Hence, 300M learns my speakers better. Streaming is supported, but I didn't have much experience. Hence, we can say it is real time on GPU. The best thing is that the training takes too short (max 1 day) and you get a good enough multispeaker TTS. If you train multilingual, cosy1-300M is also better for my experiences. Acoustic similarity is really good, but there are phonetic errors in generations. In cosy2, authors removed the text encoder, which might cause phonetic error issues.
@EmreOzkose Thanks for your detailed answer. How is the performance in Spanish? Can you please share your model weights and inference code? Actually i tried with the pre-trained CosyVoice2-0.5B model, and the rtf is always above 1. I am not sure if it depends on GPU. i have NVIDIA A30 GPU with 24GB memory.
Inference code is
from cosyvoice.cli.cosyvoice import CosyVoice2
model_folder = "exps/exp1/cosyvoice2/llm/torch_ddp/epoch_0_whole.pt"
cosyvoice = CosyVoice2(save_folder, load_jit=False, load_trt=False, fp16=False)
print(cosyvoice.list_available_spks())
start = time.time()
spkr_name, text = "john", "hello"
result = next(cosyvoice.inference_sft(text, spkr_name, stream=False))
save_path = f'sentences_0.wav'
torchaudio.save(save_path, result['tts_speech'], cosyvoice.sample_rate)
end = time.time()
rtf = (end - start) / sf.info(save_path).duration
print(rtf)
I am not a Spanish native, hence I am not able to detect phoneme errors accurately, but sounds good. My GPU is A5000. I think yours is better. My RTF measurements are done for Cosy_v1-300M.
@EmreOzkose thanks for providing details. Did you train with neutral data or emotionally labeled data? I wonder if trained with neutral data for new language, will the model retain its original capabilities like taking instructions and tts in chinese language? Or it will forget? How is it in your case?
@ukemamaster I am planning to investigate. I will write observations.
@EmreOzkose, @ukemamaster Hi, any news on whether the instructions worked after fine-tuning? Or how good the results can be on non-officially supported languages? I'd guess a fewshot fine-tuned Spanish generation would be very bad, right?
Until CosyVoice3 gets open-sourced, which seems to have improved significantly multiple language support, this is the main limitation of the model. Here's a cool ongoing project I have found trying to enhance language support of Cosyvoice2 and some strategies they use: https://horstmann.tech/cosyvoice2-demo/#
Also, could you provide any insights on how did you prepare the data for fine-tuning?
Hi @AlbertoAltozano,
I couldn't train cosy2 with instructions, but I changed language tokens with speaker names like
'<en> I want to leave here' -> '<john> I want to leave here'. In this way, model learns speaker style. How much is it good? So so. Maybe if you train with instruction style, it will be better (certainly worth to try) like
john <|endofprompt|> <en> I want to leave here (I am not sure if lang token should be there)