Add a new language but the result is a meaningless audio.
Hello @aluminumbox , I continued training the llm model on a German dataset (300 hours), but after 25k steps the model could not pronounce German and the 5 available languages.
My process:
- I followed the stages in the
run.shfile and created trainable parquet files. - In the
examples/libritts/cosyvoice/conf/cosyvoice.yamlfile I only changed theget_tokenizer, as below:
get_tokenizer: !name:whisper.tokenizer.get_tokenizer
multilingual: True
num_languages: 100
language: 'de'
task: 'transcribe'
What makes the model unusable? Here are some things I think:
- This approach is wrong: it is not possible to continue training with a new language.
- This approach works, but my process is flawed.
- Need to train more flow models, .... or new language only works training from scratch.
- Cannot use model at 25k steps, need more training.
- ...
Thanks for taking the time to reply.
do not change get_tokenizer: !name:whisper.tokenizer.get_tokenizer param
thanks @aluminumbox for the reply, one more question! My German dataset is a bit mixed with English, is this ok? Or should I use a pure German dataset.
Try using something like this:
text = f"<|{lang}|>" + data["sentence"].strip()
Try using something like this:
text = f"<|{lang}|>" + data["sentence"].strip()
Hi @MiXaiLL76 , do you mean when adding a new language my data needs to have the language prefix in the text?
The content in the txt files should be:
- Pure
|<de>|Hallo zusammen, ich komme aus München.
- Mixed
|<de>|Hallo zusammen, ich bin |<en>|John Biden.
Yes, that's right. The only thing is that it will take a long time to learn. 24 hours of training did not give me very good results. But the model already speaks my language and English well (I also mixed the data)
In principle, I can share the code with which I trained my models as an example in this repository. If I am supported)
@aluminumbox
In principle, I can share the code with which I trained my models as an example in this repository. If I am supported)
@aluminumbox
thank you, but the most import part is how to obtain language_id actually. currently our code do not support detect language automatically, and our data preprocess do not pack language id. so only add language id in data loader is not enough.
In principle, I can share the code with which I trained my models as an example in this repository. If I am supported) @aluminumbox
thank you, but the most import part is how to obtain language_id actually. currently our code do not support detect language automatically, and our data preprocess do not pack language id. so only add language id in data loader is not enough.
Do I need to train speech_tokenizer_v1 for this?
In principle, I managed to train the model well in my language (Russian). By adding the language identifier to f"<|{lang}|>".
But there are some small errors:
- The accent the model speaks with is closer to English.
- Words are sometimes stuttered and repeated.
Now I am studying how to train the flow model, because I assume that it is also important for a new language.
In principle, I managed to train the model well in my language (Russian). By adding the language identifier to f"<|{lang}|>".
But there are some small errors:
- The accent the model speaks with is closer to English.
- Words are sometimes stuttered and repeated.
Now I am studying how to train the flow model, because I assume that it is also important for a new language.
you can try train with more russian data with <|lang|> added to text, I believe the pronunciation problem will decrease, but also remember that english pronunciation problem will increase. I think you should try with more russian data, then decided whether or not you should train the speech tokenizer
In principle, I managed to train the model well in my language (Russian). By adding the language identifier to f"<|{lang}|>". But there are some small errors:
- The accent the model speaks with is closer to English.
- Words are sometimes stuttered and repeated.
Now I am studying how to train the flow model, because I assume that it is also important for a new language.
you can try train with more russian data with <|lang|> added to text, I believe the pronunciation problem will decrease, but also remember that english pronunciation problem will increase. I think you should try with more russian data, then decided whether or not you should train the speech tokenizer
Currently, more than 60 thousand Russian audio and 60 thousand English audio are participating in the training.
our base model has no Russian training data. I believe you need at least 1k hour Russian data for training
There are quite a few hours of data.
Perhaps I didn't set up the training very well.
train_conf:
optim: adam
optim_conf:
lr: 1e-4
scheduler: constantlr
not you should train the speech tokenizer
@aluminumbox Which speech tokenizer did you learn? I assume it was from here https://github.com/modelscope/FunCodec But judging by the onnx model, I can't find a similar one.
Could you share the link? I would study it and maybe prepare training scripts for the public
@aluminumbox After a long work, I realized that for another language, in my case Russian, it is simply necessary to train a new speech_tokenizer. embedding can basically not be trained, but left as is (I assume that you used this https://github.com/lovemefan/campplus)
Please show an example of training speech_tokenizer and I will be ready to personally make an example of training a model for the Russian dataset.
hi @MiXaiLL76 did you found a way to train new languages correctly? thanks
hi @MiXaiLL76 did you found a way to train new languages correctly? thanks
No, but I'm sure to achieve normal tts you need more than llm training
For German just use XTTS. It's working very good. Samples I made are here: [https://soundcloud.com/cylonius]