metavoice-src icon indicating copy to clipboard operation
metavoice-src copied to clipboard

Fine-tuning voice-cloning capability of metavoice

Open abhijeethp opened this issue 1 year ago • 2 comments

Hey Team, Can anyone help me understand the following regarding the metavoice model fine-tuning process? https://github.com/metavoiceio/metavoice-src/tree/main?tab=readme-ov-file#finetuning

  • For fine-tuning the mode what is the minimum and maximum audio length I can use that is allowed by the system?
  • The fine-tuning script takes only 2 files as input -- a speech (audio) file and it's transcription. How is this possible? is the SiSNR calculated against the same audio?
  • I want fine-tune the voice cloning aspect of metavoice if possible. Is there anything extra I need to implement to do this?

abhijeethp avatar Apr 27 '24 01:04 abhijeethp

Old man voice

Arman12345677 avatar May 07 '24 21:05 Arman12345677

Hey @abhijeethp, sorry for only getting to this now, we've seen people finetuning using chunks of 5-10s audio in their training datasets (but it's not a hard range). We're not calculating SiSNR as part of finetuning - are you asking whether using the same audio is appropriate?

Re finetuning the voice cloning, you should be all good if you follow the finetuning guide with a solid dataset & play around with the hyperparameters and then use a good reference clip upon inference.

lucapericlp avatar May 14 '24 21:05 lucapericlp