fairseq icon indicating copy to clipboard operation
fairseq copied to clipboard

fine-tuning MMS TTS models

Open taalua opened this issue 1 year ago • 21 comments

Hi,

How to fine-tune MMS TTS models. I used the default vits code, however, i had issues when resuming from the existing optimizer state dict: " in adamw exp_avg.mul_(beta1).add_(grad, alpha=1 - beta1) RuntimeError: The size of tensor a (38) must match the size of tensor b (178) at non-singleton dimension 0 "

Please help. Thanks.

taalua avatar May 31 '23 17:05 taalua

@taalua This is probably due to the mismatch in vocabulary between the original VITS code and ours. The vocabulary VITS uses is hard coded here, which were used to get symbol-to-id mapping, while we use a different vocabulary per language, specified in vocab.txt. You can use this to get text mapping for MMS TTS models and use its id_to_symbol instead.

chevalierNoir avatar Jun 01 '23 03:06 chevalierNoir

BTW, we're working on making this very easy in transformers. You can check:

  • https://huggingface.co/docs/transformers/main/en/model_doc/mms
  • https://github.com/huggingface/transformers/pull/23813
  • https://huggingface.co/facebook/mms-1b-all

patrickvonplaten avatar Jun 02 '23 11:06 patrickvonplaten

is there a guide on adding a TTS language? I'm thinking of adding Nepali which has Language ID and ASR but no TTS

ravsau avatar Jun 06 '23 00:06 ravsau

is there a guide on adding a TTS language? I'm thinking of adding Nepali which has Language ID and ASR but no TTS

Most of the VITS code remains unchanged. You only need to define the vocabulary of the new language (i.e., a list of characters used in the new language) and use that as the symbols here.

chevalierNoir avatar Jun 06 '23 00:06 chevalierNoir

is there a guide on adding a TTS language? I'm thinking of adding Nepali which has Language ID and ASR but no TTS

Most of the VITS code remains unchanged. You only need to define the vocabulary of the new language (i.e., a list of characters used in the new language) and use that as the symbols here.

@chevalierNoir Eng model is working with a random discriminator checkpoint however I met this error when fine-tuning Kor model:

packages/bitsandbytes/optim/optimizer.py", line 455, in update_step
    if state["state1"].dtype == torch.float:
KeyError: 'state1'

I cannot find out why the two models ain't behave the same way. The main difference from my perspective is whether the checkpoint has pre-trained optimizer states or not.

CopyNinja1999 avatar Jun 10 '23 06:06 CopyNinja1999

@CopyNinja1999 Did you download the full model checkpoint (including generator, discriminator, optimizer states) for fine-tuning as is suggested here? eng and kor should be of the same format.

chevalierNoir avatar Jun 12 '23 03:06 chevalierNoir

@chevalierNoir Thanks for your reply! I find out that this error was caused by the bnb optimizer wrapper from this repo https://github.com/nivibilla/efficient-vits-finetuning. About the full model checkpoint, yes, I tested them yesterday using romanizer https://github.com/osori/korean-romanizer and this dataset https://www.kaggle.com/datasets/bryanpark/korean-single-speaker-speech-dataset. The audio synthesised became pure noise after fine-tuning. (however fine-tuning works in eng model) Do you have any hint why?

CopyNinja1999 avatar Jun 13 '23 02:06 CopyNinja1999

btw, what is the romanizer you use for all the languages?

CopyNinja1999 avatar Jun 13 '23 02:06 CopyNinja1999

Update:resample the audio from 44khz to 22.05khz and worked.

CopyNinja1999 avatar Jun 13 '23 07:06 CopyNinja1999

btw, what is the romanizer you use for all the languages?

In case it's needed, our romanizer is this. Note we only do uromanization for ~5 languages with a large character vocabulary. Otherwise using raw characters achieves slightly better performance.

chevalierNoir avatar Jun 13 '23 13:06 chevalierNoir

We now have a super simply fine-tuning script in Transformers: https://github.com/huggingface/transformers/tree/main/examples/pytorch/speech-recognition#connectionist-temporal-classification-with-adapters

patrickvonplaten avatar Jun 15 '23 08:06 patrickvonplaten

We now have a super simply fine-tuning script in Transformers: https://github.com/huggingface/transformers/tree/main/examples/pytorch/speech-recognition#connectionist-temporal-classification-with-adapters

how about TTS?

andergisomon avatar Jun 22 '23 22:06 andergisomon

Working on it cc @sanchit-gandhi

patrickvonplaten avatar Jun 27 '23 19:06 patrickvonplaten

Adding the models first to the library in https://github.com/huggingface/transformers/pull/24085, then will add training functionality in a second step 🤗

sanchit-gandhi avatar Jun 29 '23 12:06 sanchit-gandhi

Adding the models first to the library in huggingface/transformers#24085, then will add training functionality in a second step 🤗

Can't wait! 😀

qunash avatar Jun 29 '23 16:06 qunash

Hi @chevalierNoir @patrickvonplaten

for English, I see vocabulary: '1', '5', '6' from the list, what does it mean for each? also what's the difference between '–' and '_' ?

FYI, English vocabulary: ['k', "'", 'z', 'y', 'u', 'd', 'h', 'e', 's', 'w', '–', '3', 'c', 'p', '-', '1', 'j', 'm', 'i', ' ', 'f', 'l', 'o', '0', 'b', 'r', 'a', '4', '2', 'n', '_', 'x', 'v', 't', 'q', '5', '6', 'g']

Thanks

taalua avatar Jun 29 '23 17:06 taalua

Any update on this?

kdcyberdude avatar Aug 08 '23 16:08 kdcyberdude

please update about adding training functionality @sanchit-gandhi

Salama1429 avatar Sep 13 '23 14:09 Salama1429

any update on this? @sanchit-gandhi

arbianqx avatar Oct 05 '23 19:10 arbianqx

Just stopping here for an update. could anyone please help me with a TTS finetuning codebase?

owos avatar Oct 16 '23 18:10 owos

I've released this repository to allow VITS/MMS finetuning with transformer compatibility: https://github.com/ylacombe/finetune-hf-vits Feel free to check it :hugs:

ylacombe avatar Dec 14 '23 18:12 ylacombe