metavoice-src
metavoice-src copied to clipboard
Fine-tuning voice-cloning capability of metavoice
Hey Team, Can anyone help me understand the following regarding the metavoice model fine-tuning process? https://github.com/metavoiceio/metavoice-src/tree/main?tab=readme-ov-file#finetuning
- For fine-tuning the mode what is the minimum and maximum audio length I can use that is allowed by the system?
- The fine-tuning script takes only 2 files as input -- a speech (audio) file and it's transcription. How is this possible? is the SiSNR calculated against the same audio?
- I want fine-tune the voice cloning aspect of metavoice if possible. Is there anything extra I need to implement to do this?
Old man voice
Hey @abhijeethp, sorry for only getting to this now, we've seen people finetuning using chunks of 5-10s audio in their training datasets (but it's not a hard range). We're not calculating SiSNR as part of finetuning - are you asking whether using the same audio is appropriate?
Re finetuning the voice cloning, you should be all good if you follow the finetuning guide with a solid dataset & play around with the hyperparameters and then use a good reference clip upon inference.