CosyVoice Fine-tuning strategy for cross-lingual

Hi, I'm working on building a Korean-focused cross-lingual TTS system using CosyVoice, and I have a few questions regarding the fine-tuning process.

Setup:

Input languages: English, Japanese
Target language: Korean (speech output)
Dataset: Approximately 220 hours of multi-speaker Korean speech data

Questions:

In a cross-lingual setting like English/Japanese to Korean speech, which components should be fine-tuned for the best performance?
- Only the LLM (text-to-latent)
- LLM + Flow
- LLM + Flow + HiFi-GAN I’d like to understand which modules contribute most significantly to improving output quality in this scenario.
Regarding the dataset:
- Is 220 hours of multi-speaker Korean data sufficient for meaningful fine-tuning?
- What is the recommended average duration per utterance? Should I aim for shorter samples (e.g., 3 seconds), or are longer samples preferable?
Are there any recommended hyperparameters or training strategies?
- Learning rate, batch size, warm-up steps
- Whether to freeze specific modules during training
- Any adjustments you'd recommend for Korean or multi-speaker settings

Thank you.

Jul 22 '25 02:07 pokabookinflab

most importantly you need to train llm. I think 220 hour can significantly improve korean language performance, utt duration should be like normal sentence, for example 5-15 seconds

Jul 23 '25 06:07 aluminumbox

@aluminumbox Thank you for your kind response.

I have a few more questions. When fine-tuning an LLM, how should the initial learning rate be determined? Is it okay to use the default values provided in the cosyvoice2.yaml file?

Also, I’ve noticed that the speaking style, breathing sounds, and tone can vary depending on the seed. Is there a recommended way to find the optimal seed?

Jul 24 '25 07:07 pokabookinflab

@aluminumbox Thank you for your kind response.

I have a few more questions. When fine-tuning an LLM, how should the initial learning rate be determined? Is it okay to use the default values provided in the cosyvoice2.yaml file?

Also, I’ve noticed that the speaking style, breathing sounds, and tone can vary depending on the seed. Is there a recommended way to find the optimal seed?

use 1e-5 lr when sft llm for optimal seed, you can use some evaluation tool like asr model, emotion model to do post-evaluation

Jul 28 '25 03:07 aluminumbox

This issue is stale because it has been open for 30 days with no activity.

Aug 28 '25 02:08 github-actions[bot]