fairseq fine-tuning MMS TTS models

Hi,

How to fine-tune MMS TTS models. I used the default vits code, however, i had issues when resuming from the existing optimizer state dict: " in adamw exp_avg.mul_(beta1).add_(grad, alpha=1 - beta1) RuntimeError: The size of tensor a (38) must match the size of tensor b (178) at non-singleton dimension 0 "

Please help. Thanks.

May 31 '23 17:05 taalua

@taalua This is probably due to the mismatch in vocabulary between the original VITS code and ours. The vocabulary VITS uses is hard coded here, which were used to get symbol-to-id mapping, while we use a different vocabulary per language, specified in vocab.txt. You can use this to get text mapping for MMS TTS models and use its id_to_symbol instead.

Jun 01 '23 03:06 chevalierNoir

BTW, we're working on making this very easy in transformers. You can check:

https://huggingface.co/docs/transformers/main/en/model_doc/mms
https://github.com/huggingface/transformers/pull/23813
https://huggingface.co/facebook/mms-1b-all

Jun 02 '23 11:06 patrickvonplaten

is there a guide on adding a TTS language? I'm thinking of adding Nepali which has Language ID and ASR but no TTS

Jun 06 '23 00:06 ravsau

is there a guide on adding a TTS language? I'm thinking of adding Nepali which has Language ID and ASR but no TTS

Most of the VITS code remains unchanged. You only need to define the vocabulary of the new language (i.e., a list of characters used in the new language) and use that as the symbols here.

Jun 06 '23 00:06 chevalierNoir

is there a guide on adding a TTS language? I'm thinking of adding Nepali which has Language ID and ASR but no TTS

Most of the VITS code remains unchanged. You only need to define the vocabulary of the new language (i.e., a list of characters used in the new language) and use that as the symbols here.

@chevalierNoir Eng model is working with a random discriminator checkpoint however I met this error when fine-tuning Kor model:

packages/bitsandbytes/optim/optimizer.py", line 455, in update_step
    if state["state1"].dtype == torch.float:
KeyError: 'state1'

I cannot find out why the two models ain't behave the same way. The main difference from my perspective is whether the checkpoint has pre-trained optimizer states or not.

Jun 10 '23 06:06 CopyNinja1999

@CopyNinja1999 Did you download the full model checkpoint (including generator, discriminator, optimizer states) for fine-tuning as is suggested here? eng and kor should be of the same format.

Jun 12 '23 03:06 chevalierNoir

@chevalierNoir Thanks for your reply! I find out that this error was caused by the bnb optimizer wrapper from this repo https://github.com/nivibilla/efficient-vits-finetuning. About the full model checkpoint, yes, I tested them yesterday using romanizer https://github.com/osori/korean-romanizer and this dataset https://www.kaggle.com/datasets/bryanpark/korean-single-speaker-speech-dataset. The audio synthesised became pure noise after fine-tuning. (however fine-tuning works in eng model) Do you have any hint why?

Jun 13 '23 02:06 CopyNinja1999

btw, what is the romanizer you use for all the languages?

Jun 13 '23 02:06 CopyNinja1999

Update：resample the audio from 44khz to 22.05khz and worked.

Jun 13 '23 07:06 CopyNinja1999

btw, what is the romanizer you use for all the languages?

In case it's needed, our romanizer is this. Note we only do uromanization for ~5 languages with a large character vocabulary. Otherwise using raw characters achieves slightly better performance.

Jun 13 '23 13:06 chevalierNoir

We now have a super simply fine-tuning script in Transformers: https://github.com/huggingface/transformers/tree/main/examples/pytorch/speech-recognition#connectionist-temporal-classification-with-adapters

Jun 15 '23 08:06 patrickvonplaten

We now have a super simply fine-tuning script in Transformers: https://github.com/huggingface/transformers/tree/main/examples/pytorch/speech-recognition#connectionist-temporal-classification-with-adapters

how about TTS?

Jun 22 '23 22:06 andergisomon

Working on it cc @sanchit-gandhi

Jun 27 '23 19:06 patrickvonplaten

Adding the models first to the library in https://github.com/huggingface/transformers/pull/24085, then will add training functionality in a second step 🤗

Jun 29 '23 12:06 sanchit-gandhi

Adding the models first to the library in huggingface/transformers#24085, then will add training functionality in a second step 🤗

Can't wait! 😀

Jun 29 '23 16:06 qunash

Hi @chevalierNoir @patrickvonplaten

for English, I see vocabulary: '1', '5', '6' from the list, what does it mean for each? also what's the difference between '–' and '_' ?

FYI, English vocabulary: ['k', "'", 'z', 'y', 'u', 'd', 'h', 'e', 's', 'w', '–', '3', 'c', 'p', '-', '1', 'j', 'm', 'i', ' ', 'f', 'l', 'o', '0', 'b', 'r', 'a', '4', '2', 'n', '_', 'x', 'v', 't', 'q', '5', '6', 'g']

Thanks

Jun 29 '23 17:06 taalua

Any update on this?

Aug 08 '23 16:08 kdcyberdude

please update about adding training functionality @sanchit-gandhi

Sep 13 '23 14:09 Salama1429

any update on this? @sanchit-gandhi

Oct 05 '23 19:10 arbianqx

Just stopping here for an update. could anyone please help me with a TTS finetuning codebase?

Oct 16 '23 18:10 owos

I've released this repository to allow VITS/MMS finetuning with transformer compatibility: https://github.com/ylacombe/finetune-hf-vits Feel free to check it :hugs:

Dec 14 '23 18:12 ylacombe

fairseq fairseq copied to clipboard

fine-tuning MMS TTS models

fairseq
fairseq copied to clipboard