CosyVoice icon indicating copy to clipboard operation
CosyVoice copied to clipboard

Add a new language but the result is a meaningless audio.

Open drlor2k opened this issue 1 year ago • 17 comments

Hello @aluminumbox , I continued training the llm model on a German dataset (300 hours), but after 25k steps the model could not pronounce German and the 5 available languages.

My process:

  • I followed the stages in the run.sh file and created trainable parquet files.
  • In the examples/libritts/cosyvoice/conf/cosyvoice.yaml file I only changed the get_tokenizer, as below:
get_tokenizer: !name:whisper.tokenizer.get_tokenizer
    multilingual: True
    num_languages: 100
    language: 'de'
    task: 'transcribe'

What makes the model unusable? Here are some things I think:

  1. This approach is wrong: it is not possible to continue training with a new language.
  2. This approach works, but my process is flawed.
  3. Need to train more flow models, .... or new language only works training from scratch.
  4. Cannot use model at 25k steps, need more training.
  5. ...

Thanks for taking the time to reply.

drlor2k avatar Sep 01 '24 12:09 drlor2k

do not change get_tokenizer: !name:whisper.tokenizer.get_tokenizer param

aluminumbox avatar Sep 03 '24 07:09 aluminumbox

thanks @aluminumbox for the reply, one more question! My German dataset is a bit mixed with English, is this ok? Or should I use a pure German dataset.

drlor2k avatar Sep 03 '24 09:09 drlor2k

Try using something like this:

text = f"<|{lang}|>" + data["sentence"].strip()

MiXaiLL76 avatar Sep 04 '24 14:09 MiXaiLL76

Try using something like this:

text = f"<|{lang}|>" + data["sentence"].strip()

Hi @MiXaiLL76 , do you mean when adding a new language my data needs to have the language prefix in the text?

The content in the txt files should be:

  • Pure
|<de>|Hallo zusammen, ich komme aus München.
  • Mixed
|<de>|Hallo zusammen, ich bin |<en>|John Biden.

drlor2k avatar Sep 04 '24 19:09 drlor2k

Yes, that's right. The only thing is that it will take a long time to learn. 24 hours of training did not give me very good results. But the model already speaks my language and English well (I also mixed the data)

MiXaiLL76 avatar Sep 05 '24 07:09 MiXaiLL76

In principle, I can share the code with which I trained my models as an example in this repository. If I am supported)

@aluminumbox

MiXaiLL76 avatar Sep 05 '24 11:09 MiXaiLL76

In principle, I can share the code with which I trained my models as an example in this repository. If I am supported)

@aluminumbox

thank you, but the most import part is how to obtain language_id actually. currently our code do not support detect language automatically, and our data preprocess do not pack language id. so only add language id in data loader is not enough.

aluminumbox avatar Sep 06 '24 02:09 aluminumbox

In principle, I can share the code with which I trained my models as an example in this repository. If I am supported) @aluminumbox

thank you, but the most import part is how to obtain language_id actually. currently our code do not support detect language automatically, and our data preprocess do not pack language id. so only add language id in data loader is not enough.

Do I need to train speech_tokenizer_v1 for this?

MiXaiLL76 avatar Sep 06 '24 07:09 MiXaiLL76

In principle, I managed to train the model well in my language (Russian). By adding the language identifier to f"<|{lang}|>".

But there are some small errors:

  1. The accent the model speaks with is closer to English.
  2. Words are sometimes stuttered and repeated.

Now I am studying how to train the flow model, because I assume that it is also important for a new language.

MiXaiLL76 avatar Sep 06 '24 07:09 MiXaiLL76

In principle, I managed to train the model well in my language (Russian). By adding the language identifier to f"<|{lang}|>".

But there are some small errors:

  1. The accent the model speaks with is closer to English.
  2. Words are sometimes stuttered and repeated.

Now I am studying how to train the flow model, because I assume that it is also important for a new language.

you can try train with more russian data with <|lang|> added to text, I believe the pronunciation problem will decrease, but also remember that english pronunciation problem will increase. I think you should try with more russian data, then decided whether or not you should train the speech tokenizer

aluminumbox avatar Sep 06 '24 07:09 aluminumbox

In principle, I managed to train the model well in my language (Russian). By adding the language identifier to f"<|{lang}|>". But there are some small errors:

  1. The accent the model speaks with is closer to English.
  2. Words are sometimes stuttered and repeated.

Now I am studying how to train the flow model, because I assume that it is also important for a new language.

you can try train with more russian data with <|lang|> added to text, I believe the pronunciation problem will decrease, but also remember that english pronunciation problem will increase. I think you should try with more russian data, then decided whether or not you should train the speech tokenizer

Currently, more than 60 thousand Russian audio and 60 thousand English audio are participating in the training.

MiXaiLL76 avatar Sep 06 '24 07:09 MiXaiLL76

our base model has no Russian training data. I believe you need at least 1k hour Russian data for training

aluminumbox avatar Sep 06 '24 07:09 aluminumbox

изображение There are quite a few hours of data.

Perhaps I didn't set up the training very well.

train_conf:
optim: adam
optim_conf:
lr: 1e-4
scheduler: constantlr

MiXaiLL76 avatar Sep 06 '24 08:09 MiXaiLL76

not you should train the speech tokenizer

@aluminumbox Which speech tokenizer did you learn? I assume it was from here https://github.com/modelscope/FunCodec But judging by the onnx model, I can't find a similar one.

Could you share the link? I would study it and maybe prepare training scripts for the public

MiXaiLL76 avatar Sep 08 '24 15:09 MiXaiLL76

@aluminumbox After a long work, I realized that for another language, in my case Russian, it is simply necessary to train a new speech_tokenizer. embedding can basically not be trained, but left as is (I assume that you used this https://github.com/lovemefan/campplus)

Please show an example of training speech_tokenizer and I will be ready to personally make an example of training a model for the Russian dataset.

MiXaiLL76 avatar Sep 12 '24 19:09 MiXaiLL76

hi @MiXaiLL76 did you found a way to train new languages correctly? thanks

justinatbahasa avatar Oct 08 '24 06:10 justinatbahasa

hi @MiXaiLL76 did you found a way to train new languages correctly? thanks

No, but I'm sure to achieve normal tts you need more than llm training

MiXaiLL76 avatar Oct 09 '24 10:10 MiXaiLL76

For German just use XTTS. It's working very good. Samples I made are here: [https://soundcloud.com/cylonius]

Cyl0nius avatar May 30 '25 12:05 Cyl0nius