fairseq
fairseq copied to clipboard
fine-tuning MMS TTS models
Hi,
How to fine-tune MMS TTS models. I used the default vits code, however, i had issues when resuming from the existing optimizer state dict: " in adamw exp_avg.mul_(beta1).add_(grad, alpha=1 - beta1) RuntimeError: The size of tensor a (38) must match the size of tensor b (178) at non-singleton dimension 0 "
Please help. Thanks.
@taalua This is probably due to the mismatch in vocabulary between the original VITS code and ours. The vocabulary VITS uses is hard coded here, which were used to get symbol-to-id mapping, while we use a different vocabulary per language, specified in vocab.txt
. You can use this to get text mapping for MMS TTS models and use its id_to_symbol
instead.
BTW, we're working on making this very easy in transformers
. You can check:
- https://huggingface.co/docs/transformers/main/en/model_doc/mms
- https://github.com/huggingface/transformers/pull/23813
- https://huggingface.co/facebook/mms-1b-all
is there a guide on adding a TTS language? I'm thinking of adding Nepali which has Language ID and ASR but no TTS
is there a guide on adding a TTS language? I'm thinking of adding Nepali which has Language ID and ASR but no TTS
Most of the VITS code remains unchanged. You only need to define the vocabulary of the new language (i.e., a list of characters used in the new language) and use that as the symbols
here.
is there a guide on adding a TTS language? I'm thinking of adding Nepali which has Language ID and ASR but no TTS
Most of the VITS code remains unchanged. You only need to define the vocabulary of the new language (i.e., a list of characters used in the new language) and use that as the
symbols
here.
@chevalierNoir Eng model is working with a random discriminator checkpoint however I met this error when fine-tuning Kor model:
packages/bitsandbytes/optim/optimizer.py", line 455, in update_step
if state["state1"].dtype == torch.float:
KeyError: 'state1'
I cannot find out why the two models ain't behave the same way. The main difference from my perspective is whether the checkpoint has pre-trained optimizer states or not.
@CopyNinja1999 Did you download the full model checkpoint (including generator, discriminator, optimizer states) for fine-tuning as is suggested here? eng
and kor
should be of the same format.
@chevalierNoir Thanks for your reply! I find out that this error was caused by the bnb optimizer wrapper from this repo https://github.com/nivibilla/efficient-vits-finetuning. About the full model checkpoint, yes, I tested them yesterday using romanizer https://github.com/osori/korean-romanizer and this dataset https://www.kaggle.com/datasets/bryanpark/korean-single-speaker-speech-dataset. The audio synthesised became pure noise after fine-tuning. (however fine-tuning works in eng model) Do you have any hint why?
btw, what is the romanizer you use for all the languages?
Update:resample the audio from 44khz to 22.05khz and worked.
btw, what is the romanizer you use for all the languages?
In case it's needed, our romanizer is this. Note we only do uromanization for ~5 languages with a large character vocabulary. Otherwise using raw characters achieves slightly better performance.
We now have a super simply fine-tuning script in Transformers: https://github.com/huggingface/transformers/tree/main/examples/pytorch/speech-recognition#connectionist-temporal-classification-with-adapters
We now have a super simply fine-tuning script in Transformers: https://github.com/huggingface/transformers/tree/main/examples/pytorch/speech-recognition#connectionist-temporal-classification-with-adapters
how about TTS?
Working on it cc @sanchit-gandhi
Adding the models first to the library in https://github.com/huggingface/transformers/pull/24085, then will add training functionality in a second step 🤗
Adding the models first to the library in huggingface/transformers#24085, then will add training functionality in a second step 🤗
Can't wait! 😀
Hi @chevalierNoir @patrickvonplaten
for English, I see vocabulary: '1', '5', '6' from the list, what does it mean for each? also what's the difference between '–' and '_' ?
FYI, English vocabulary: ['k', "'", 'z', 'y', 'u', 'd', 'h', 'e', 's', 'w', '–', '3', 'c', 'p', '-', '1', 'j', 'm', 'i', ' ', 'f', 'l', 'o', '0', 'b', 'r', 'a', '4', '2', 'n', '_', 'x', 'v', 't', 'q', '5', '6', 'g']
Thanks
Any update on this?
please update about adding training functionality @sanchit-gandhi
any update on this? @sanchit-gandhi
Just stopping here for an update. could anyone please help me with a TTS finetuning codebase?
I've released this repository to allow VITS/MMS finetuning with transformer compatibility: https://github.com/ylacombe/finetune-hf-vits Feel free to check it :hugs: