Use Uyghur converted to Latin to train the LLM model
Hello, thanks for your open source. If I use the Uyghur data converted to Latin to train the LLM model, can I directly use the CosyVoice-300M model you provided to extract discrete speech tokens in stage 2 of run.sh? Or do I need to train a Latin ASR model according to the encoder in the paper and then extract tokens? Should I use the model you provided for speaker embedding extraction in stage 1 or do I need to train it myself?
maybe @ZhihaoDU can answer this question?
I think you can first use the speech and text tokenizer without any change, then evaluating whether the performance is satisfied your task. If not, you need train a Latin ASR model to recognize Uyghur and repeat the above process. For speaker embedding, I think the provided one is enough, which is trained on hundreds of thousands of speakers.
This issue is stale because it has been open for 30 days with no activity.