tortoise-tts
tortoise-tts copied to clipboard
read.py Always changes voice for second part
When a period appears, the second part of the read.py output is a different voice, when a custom voice is used.
When the periods and commas are all replaced with |
then the entire prompt is flawless.
My dear fellow Martians, I know there are rumors going around that I'm not actually on my way to Mars, and I want to put those rumors to rest. I can assure you that those rumors are simply false. I am currently in the cockpit of the spaceship, hurtling towards our destination at unimaginable speeds. We have encountered a few minor setbacks, but rest assured, we are still on course and will arrive on schedule. I understand that there may be doubts and uncertainties, but I promise you that my commitment to our mission is unwavering. We will establish a new home on Mars, and I will be there with you every step of the way. Thank you for your understanding and your support, and I look forward to seeing you all soon.
vs
My dear fellow Martians | I know there are rumors going around that I'm not actually on my way to Mars | and I want to put those rumors to rest | I can assure you that those rumors are simply false | I am currently in the cockpit of the spaceship | hurtling towards our destination at unimaginable speeds | We have encountered a few minor setbacks | but rest assured | we are still on course and will arrive on schedule | I understand that there may be doubts and uncertainties | but I promise you that my commitment to our mission is unwavering | We will establish a new home on Mars | and I will be there with you every step of the way | Thank you for your understanding and your support | and I look forward to seeing you all soon |
Except the last trailing |
causes the clip to end in a weird groan lmao.
Can't confirm that a pipe changes the output but I am seeing a random voice take over for a sentence or two and switch back to the custom voice it was originally using.
This prompt was improved further by removing the extraneous spaces in between the sentence and the pipe separator. Also, putting a period at the end instead of a separator helped the ending.
My dear fellow Martians, I know there are rumors going around that I'm not actually on my way to Mars, and I want to put those rumors to rest| I can assure you that those rumors are simply false| I am currently in the cockpit of the spaceship, hurtling towards our destination at unimaginable speeds| We have encountered a few minor setbacks, but rest assured, we are still on course and will arrive on schedule| I understand that there may be doubts and uncertainties, but I promise you that my commitment to our mission is unwavering| We will establish a new home on Mars, and I will be there with you every step of the way| Thank you for your understanding and your support, and I look forward to seeing you all soon.
Thanks, I'll give it another try, it will make longform a heavier lift but the end results will still be worth it.
you replaced commas with | as well? How does it know to read a sentance with a pause break (for commas), as opposed to just ending the sentance there like it would for a period?
Also does this method help keep the speed and pacing to be more consistent as well? I'm noticing that some sentances are spoken fast or slower which is annoying.
Replacing commas was a mistake, and a troubleshooting step. Commas are generally good. Also so is an "em-dash" which is alt+0151: —
I am going to experiment with having no spaces around the pipe separators. Spaces can really ruin things.
This is very interesting. I am going to experiment with this tonight. Thank you for the tip and let me know if you find any more improvements.
I tested it out and it does seem to produce more concistent voices. But one thing I noticed is that it tends to cut off the ends of the last words of a sentance. Is there a way around that?
Putting a period at only the end of the prompt can sometimes help with the end getting cut off. I've found that using do_tts.py and doing multiple candidates at once sometimes provides at least one version without the end cut off. The results can be edited together and fixed with audio software.
Play around with spaces, periods, etc. That seems to influence the errors a lot.
oh and a note about do_tts.py, it doesn't support the |
character separator I don't think so just do one chunk at a time.
The em-dash (—) is proving problematic at times, too. But other times it works well. Experimentation is needed.
It seems that separating the sentences like Hello this is sentence one.| This is sentence two.| This is the last sentence.
works very well.
Edit: The format Hello this is sentence one.|This is sentence two.|This is the last sentence.
works well too.
hmm interesting. I will experiment with that thanks!
I tried Tortoise in google colab, with 15 GB of vram, and I did not seem to have any problem whatsoever with periods causing second voices to appear. It seems that maybe my low 8 GB of vram I have in my local GPU could be contributing to the errant voices appearing.
I'm facing the same issue with some of the provided voices, but not necessarily on the second part. It's just that from time to time, a sentence or a fragment of a sentence somewhere in the text will be read with a different voice.
Example with this text.txt:
What makes jee pee tee-3+ so advanced and powerful? Transformer based Large Language Models like Open AI’s jee pee tee have rapidly advanced in quality and capability. Transformers use an attention weight for each word in the input text, regardless of its position, to better consider long-term dependencies in the text and improve understanding and generation of natural language. They also allow for parallel processing which makes them faster to train and less computationally expensive, enabling the use of larger models with more accurate results. jee pee tee-3 and other similar models have a huge number of parameters (jee pee tee-3 has 175 billion params) giving them significant learning capacity. They are trained on all the text content from the public internet (jee pee tee-3 used 570 gigabytes of compressed text) to predict the probability distribution of the next word given prior words in the text. This simple objective can be performed on raw text without human labeling. When done at scale, it is surprising effective at teaching models to "understand" and generate natural language text.
I used the command:
/users/franck/workspace/tts//tortoise-tts$ python tortoise/read.py --textfile text.txt --voice freeman,train_empire --preset high_quality --candidates 1 --output_path results/longtexts
The audio file with the train_empire
voice was fine (train_empire.zip), but the audio file with the freeman
voice had 1 fragment of a sentence with a female voice ("jee pee tee-3 has 175 billion params) giving them significant learning capacity"): selected segment where the voice is messed up.zip (for the entire speech: full speech.zip).
This prompt was improved further by removing the extraneous spaces in between the sentence and the pipe separator. Also, putting a period at the end instead of a separator helped the ending.
My dear fellow Martians, I know there are rumors going around that I'm not actually on my way to Mars, and I want to put those rumors to rest| I can assure you that those rumors are simply false| I am currently in the cockpit of the spaceship, hurtling towards our destination at unimaginable speeds| We have encountered a few minor setbacks, but rest assured, we are still on course and will arrive on schedule| I understand that there may be doubts and uncertainties, but I promise you that my commitment to our mission is unwavering| We will establish a new home on Mars, and I will be there with you every step of the way| Thank you for your understanding and your support, and I look forward to seeing you all soon.
I find for the most part it switches voices when the input is split into separate lines, which can happen in multiple places in the code. Under the hood what you are doing by removing periods and putting in "|" or the em-dash is preventing splitting, unless it is forced to split because the text is too long. Putting in those characters also runs a special token, [UNK], through the model, which I've found can often work in place of periods. However, this is not ideal, because [UNK] slows down performance and sometimes produces undesirable results.
I don't know if there is a good solution without rewriting quite a bit of the splitting code, which I am not sure I want to do.
Edit: Don't use "|" with read.py. It is used to split the lines in that script.
Edit 2: Maybe "|" produces good results by splitting the text, but if so I can't figure out why. Maybe resetting the context encourages the model to use the most common/predictable/general voice from the input clips and/or trained weights.