tortoise-tts icon indicating copy to clipboard operation
tortoise-tts copied to clipboard

How to customize for another language

Open rohanjhanepal opened this issue 11 months ago • 12 comments

I am not able to customize it, how can I customize it for another language?

rohanjhanepal avatar Aug 12 '23 04:08 rohanjhanepal

Hey did you try to figure this out? I want to customise for Hindi

Mayank-Sharma-27 avatar Aug 26 '23 01:08 Mayank-Sharma-27

I'm also trying to find customizations for other languages but can't find the documentation, can someone help me, please!

ctimict avatar Sep 24 '23 07:09 ctimict

Adding a new voice To add new voices to Tortoise, you will need to do the following:

Gather audio clips of your speaker(s). Good sources are YouTube interviews (you can use youtube-dl to fetch the audio), audiobooks or podcasts. Guidelines for good clips are in the next section. Cut your clips into ~10 second segments. You want at least 3 clips. More is better, but I only experimented with up to 5 in my testing. Save the clips as a WAV file with floating point format and a 22,050 sample rate. Create a subdirectory in voices/ Put your clips in that subdirectory. Run tortoise utilities with --voice=.

Picking good reference clips As mentioned above, your reference clips have a profound impact on the output of Tortoise. Following are some tips for picking good clips:

Avoid clips with background music, noise or reverb. These clips were removed from the training dataset. Tortoise is unlikely to do well with them. Avoid speeches. These generally have distortion caused by the amplification system. Avoid clips from phone calls. Avoid clips that have excessive stuttering, stammering or words like "uh" or "like" in them. Try to find clips that are spoken in such a way as you wish your output to sound like. For example, if you want to hear your target voice read an audiobook, try to find clips of them reading a book. The text being spoken in the clips does not matter, but diverse text does seem to perform better.

merolaika avatar Nov 03 '23 10:11 merolaika

Just gather a lot of data in a targeted language like 10k hours of data , train your own bpe tokenizer and fine-tune autoregressive model from DL-Art- School.

manmay-nakhashi avatar Nov 03 '23 11:11 manmay-nakhashi

Just gather a lot of data in a targeted language like 10k hours of data , train your own bpe tokenizer and fine-tune autoregressive model from DL-Art- School.

Have you done it? Is there any YT tutorial on this?

Fizikaz avatar Nov 03 '23 15:11 Fizikaz

@manmay-nakhashi Quick question, Every tokenizer I train results in gibberish after training using DLAS. I am trying to create tokenizer that works, any tips here ? Only tokenizer that works is: https://huggingface.co/AOLCDROM/Tortoise-TTS-de/tree/main but it is not good enough.

aklacar1 avatar Nov 20 '23 18:11 aklacar1

@manmay-nakhashi What would be needed if we want to use larger tokenizer with DLAS and Tortoise ? e.g. 512 token ?

aklacar1 avatar Nov 20 '23 18:11 aklacar1

@aklacar1 you might need to modify a few things to support a larger tokenizer, but keeping the tokenizer at 256 will work out of the box , if you have enough data ~10k hours.

manmay-nakhashi avatar Nov 21 '23 02:11 manmay-nakhashi

Just gather a lot of data in a targeted language like 10k hours of data , train your own bpe tokenizer and fine-tune autoregressive model from DL-Art- School.

How many speakers should dataset strive for? Would be 1k hours of data enough? Also, it is not clear, how the splitting should be done, does it need to split by the sentences or 10s chunks, split by silence?

Fizikaz avatar Nov 27 '23 15:11 Fizikaz

Hello, can I ask how to create a tokenizer file for Japanese? I see that Japanese people use some Kanji in sentences or words. I found a simple tokenizer file containing hiragana and katakana words, I think I can use it but it have litter "merges" words and Kanji words. File link: https://git.ecker.tech/mrq/ai-voice-cloning/src/branch/master/models/tokenizers/japanese.json

GuenKainto avatar Dec 28 '23 09:12 GuenKainto

@aklacar1 you might need to modify a few things to support a larger tokenizer, but keeping the tokenizer at 256 will work out of the box , if you have enough data ~10k hours.

Hi, i try to create a tokenizer for japanese but it have vocal_size is 3000 (in 7696 lines of text) what thing i should modify for training ? Thankyou.

GuenKainto avatar Dec 29 '23 08:12 GuenKainto

Can anyone share output for hindi cloned audio?

super-animo avatar Apr 11 '24 08:04 super-animo