CosyVoice Use Uyghur converted to Latin to train the LLM model

Hello, thanks for your open source. If I use the Uyghur data converted to Latin to train the LLM model, can I directly use the CosyVoice-300M model you provided to extract discrete speech tokens in stage 2 of run.sh? Or do I need to train a Latin ASR model according to the encoder in the paper and then extract tokens? Should I use the model you provided for speaker embedding extraction in stage 1 or do I need to train it myself?

Sep 01 '24 07:09 CriDora

maybe @ZhihaoDU can answer this question?

Sep 03 '24 07:09 aluminumbox

I think you can first use the speech and text tokenizer without any change, then evaluating whether the performance is satisfied your task. If not, you need train a Latin ASR model to recognize Uyghur and repeat the above process. For speaker embedding, I think the provided one is enough, which is trained on hundreds of thousands of speakers.

Sep 05 '24 06:09 ZhihaoDU

This issue is stale because it has been open for 30 days with no activity.

Oct 06 '24 02:10 github-actions[bot]