tortoise-tts icon indicating copy to clipboard operation
tortoise-tts copied to clipboard

44100 24000 22050 Hz sampling rate frequency limitations - why? what are the consequences?

Open ghostlyghastly opened this issue 2 years ago • 3 comments

Hi, tortoise is amazing!

Fun fact: it is not good at imitating duke nukem, even at high_quality preset, as probably expected 😄

I was wondering, why 22050 Hz voice input is required, what would happen if 44100 sampling rate input was provided and why the output is 24000 Hz rather than 22050 Hz?

I'm guessing 22050 was used to make training faster.

I tried a voice with 44.1kHz sample rate and the output sounded fine to me. Is using different rates safe after all? I didn't check how this affected speed.

ghostlyghastly avatar Feb 20 '23 16:02 ghostlyghastly

Bump!! I was wondering the exact same thing and am hoping for an answer from someone smart!

oganesso avatar Mar 01 '23 15:03 oganesso

The difference of 22kHz and 24kHz is due to a change I made mid-training that cannot be fixed without retraining the whole stack.

If you provide a 44kHz input I believe it will get automatically resampled to 22kHz. If it doesn't, you'll get a voice that sounds like a chipmunk.

neonbjb avatar Mar 01 '23 17:03 neonbjb

The difference of 22kHz and 24kHz is due to a change I made mid-training that cannot be fixed without retraining the whole stack.

If you provide a 44kHz input I believe it will get automatically resampled to 22kHz. If it doesn't, you'll get a voice that sounds like a chipmunk.

can you elaborate on the reason you chose 22khz mid training? also, is it possible to quantize the model to 4bit for faster inference? btw really love tortoise, this got insane potential, i did some work on it by making a TTS extension for booga ui, but there's a lot of work that needs to be done with catching up to the current version of transformers, my friend is currently working on it, hopefully one day we can do "from_pretrained" :D

i had another couple of questions, which im really curious about, if u don't mind: 1)any chance u know what would be the effects on the output using a larger gpt2 or even gpt3 model inside the autoregressive one? 2)why the tokenizer the way it is? its really small, and contains words like "is, you, they" etc. why these choices, and why this tokenizer?

again, thank you so much for your work, tortoise is really awesome! i believe it can change the whole TTS world, i really have no idea how u came up with that stuff, seems like sorcery making the models do what they do...

SicariusSicariiStuff avatar Feb 07 '24 23:02 SicariusSicariiStuff