fish-speech icon indicating copy to clipboard operation
fish-speech copied to clipboard

Add a documentation page for data quality required for fine-tuning

Open Aml-Hassan-Abd-El-hamid opened this issue 1 year ago • 10 comments

Self Checks

  • [X] I have searched for existing issues search for existing issues, including closed ones.
  • [X] I confirm that I am using English to submit this report (我已阅读并同意 Language Policy).
  • [X] [FOR CHINESE USERS] 请务必使用英文提交 Issue,否则会被关闭。谢谢!:)
  • [X] Please do not modify this template :) and fill in all the required fields.

1. Is this request related to a challenge you're experiencing? Tell me about your story.

I'm trying to fine-tune the model to be able to pronounce Egyptian dialect.

I currently have a number of long videos -between 6 to 8 hours- that contain Egyptian books and the corresponding audio for different people reading those books, I'm cutting those audios into segments on silence and matching the segments to the text from the books, but I'm lacking some information to do so, such as:

  1. How long should the ideal audio/text segments be to get the best results?
  2. Should I keep the audio stereo or should I turn it to the mono channel?
  3. Should I resample those audios or keep their original frequency?
  4. should I delete the audio segments with slight background music or should I keep them?
  5. should I keep the punctuation in the text or should I delete them?
  6. Is there any cleaning for the text or the audio that should be done before fine-tuning?

2. Additional context or comments

No response

3. Can you help us with this feature?

  • [X] I am interested in contributing to this feature.

Aml-Hassan-Abd-El-hamid avatar Oct 06 '24 17:10 Aml-Hassan-Abd-El-hamid

In the later version, we plan to remove the fine-tune part. Instead, we'll add a series of tools to enhance your reference audio's quality.

PoTaTo-Mika avatar Oct 07 '24 07:10 PoTaTo-Mika

But what if I need to add a new language or a dialect that the model usually doesn't handle? We need to fine-tune the model to accomplish such a task, right?

Aml-Hassan-Abd-El-hamid avatar Oct 07 '24 09:10 Aml-Hassan-Abd-El-hamid

But what if I need to add a new language or a dialect that the model usually doesn't handle? We need to fine-tune the model to accomplish such a task, right?

True, If you want to fine-tune for a new language (though the next version will support most of spoken languages in the world) ,you may need about 2K hours of low quality data, and about 100h (the more, the better) high quality data (44.1khz with high accuracy label). Hope that this can help you with running the project.

PoTaTo-Mika avatar Oct 07 '24 10:10 PoTaTo-Mika

Thanks a lot for your response, that's really helpful, I have one last question: does the data need to be cut to a certain length? I have multiple long audios -around 7 to 8 hours each- should I cut them down to shorter segments? and If I should do so, what is the recommended segment length? 15 minutes? 5 minutes? 30 seconds?

Aml-Hassan-Abd-El-hamid avatar Oct 07 '24 10:10 Aml-Hassan-Abd-El-hamid

Thanks a lot for your response, that's really helpful, I have one last question: does the data need to be cut to a certain length? I have multiple long audios -around 7 to 8 hours each- should I cut them down to shorter segments? and If I should do so, what is the recommended segment length? 15 minutes? 5 minutes? 30 seconds?

Yes, we recommend you to cut them into 30s / per segment.

PoTaTo-Mika avatar Oct 07 '24 11:10 PoTaTo-Mika

Thank you very much for your helpful and fast responses

Aml-Hassan-Abd-El-hamid avatar Oct 07 '24 11:10 Aml-Hassan-Abd-El-hamid

But what if I need to add a new language or a dialect that the model usually doesn't handle? We need to fine-tune the model to accomplish such a task, right?

True, If you want to fine-tune for a new language (though the next version will support most of spoken languages in the world) ,you may need about 2K hours of low quality data, and about 100h (the more, the better) high quality data (44.1khz with high accuracy label). Hope that this can help you with running the project.

Thank you for your hard work on this project I was wondering if it's possible to provide a rough estimate for when the next model might be available? Even a ballpark estimate would be greatly appreciated.

GalenMarek14 avatar Oct 25 '24 12:10 GalenMarek14

This issue is stale because it has been open for 30 days with no activity.

github-actions[bot] avatar Nov 25 '24 00:11 github-actions[bot]

But what if I need to add a new language or a dialect that the model usually doesn't handle? We need to fine-tune the model to accomplish such a task, right?

True, If you want to fine-tune for a new language (though the next version will support most of spoken languages in the world) ,you may need about 2K hours of low quality data, and about 100h (the more, the better) high quality data (44.1khz with high accuracy label). Hope that this can help you with running the project.

@PoTaTo-Mika

do you have any ETA about the new version you are talking about? I'm very interested into Italian I'm actually trying to finetune f5-TTS but this is my first attempt at training a model and above all I don't have the resources ($$$) to train enough, I hope your new version includes Italian :)

MithrilMan avatar Dec 05 '24 21:12 MithrilMan

The model has supported Italian in the latest version (v1.5) , you can find the weights here: https://huggingface.co/fishaudio/fish-speech-1.5

PoTaTo-Mika avatar Dec 09 '24 13:12 PoTaTo-Mika

This issue is stale because it has been open for 30 days with no activity.

github-actions[bot] avatar Jan 09 '25 00:01 github-actions[bot]

This issue was closed because it has been inactive for 14 days since being marked as stale.

github-actions[bot] avatar Jan 25 '25 00:01 github-actions[bot]