Fine-tuning strategy for cross-lingual
Hi, I'm working on building a Korean-focused cross-lingual TTS system using CosyVoice, and I have a few questions regarding the fine-tuning process.
Setup:
- Input languages: English, Japanese
- Target language: Korean (speech output)
- Dataset: Approximately 220 hours of multi-speaker Korean speech data
Questions:
-
In a cross-lingual setting like English/Japanese to Korean speech, which components should be fine-tuned for the best performance?
- Only the LLM (text-to-latent)
- LLM + Flow
- LLM + Flow + HiFi-GAN I’d like to understand which modules contribute most significantly to improving output quality in this scenario.
-
Regarding the dataset:
- Is 220 hours of multi-speaker Korean data sufficient for meaningful fine-tuning?
- What is the recommended average duration per utterance? Should I aim for shorter samples (e.g., 3 seconds), or are longer samples preferable?
-
Are there any recommended hyperparameters or training strategies?
- Learning rate, batch size, warm-up steps
- Whether to freeze specific modules during training
- Any adjustments you'd recommend for Korean or multi-speaker settings
Thank you.
most importantly you need to train llm. I think 220 hour can significantly improve korean language performance, utt duration should be like normal sentence, for example 5-15 seconds
@aluminumbox Thank you for your kind response.
I have a few more questions. When fine-tuning an LLM, how should the initial learning rate be determined? Is it okay to use the default values provided in the cosyvoice2.yaml file?
Also, I’ve noticed that the speaking style, breathing sounds, and tone can vary depending on the seed. Is there a recommended way to find the optimal seed?
@aluminumbox Thank you for your kind response.
I have a few more questions. When fine-tuning an LLM, how should the initial learning rate be determined? Is it okay to use the default values provided in the cosyvoice2.yaml file?
Also, I’ve noticed that the speaking style, breathing sounds, and tone can vary depending on the seed. Is there a recommended way to find the optimal seed?
use 1e-5 lr when sft llm for optimal seed, you can use some evaluation tool like asr model, emotion model to do post-evaluation
This issue is stale because it has been open for 30 days with no activity.