tortoise-tts icon indicating copy to clipboard operation
tortoise-tts copied to clipboard

read.py Always changes voice for second part

Open n8bot opened this issue 1 year ago • 13 comments

When a period appears, the second part of the read.py output is a different voice, when a custom voice is used.

When the periods and commas are all replaced with | then the entire prompt is flawless.

My dear fellow Martians, I know there are rumors going around that I'm not actually on my way to Mars, and I want to put those rumors to rest. I can assure you that those rumors are simply false. I am currently in the cockpit of the spaceship, hurtling towards our destination at unimaginable speeds. We have encountered a few minor setbacks, but rest assured, we are still on course and will arrive on schedule. I understand that there may be doubts and uncertainties, but I promise you that my commitment to our mission is unwavering. We will establish a new home on Mars, and I will be there with you every step of the way. Thank you for your understanding and your support, and I look forward to seeing you all soon.

vs

My dear fellow Martians | I know there are rumors going around that I'm not actually on my way to Mars | and I want to put those rumors to rest | I can assure you that those rumors are simply false | I am currently in the cockpit of the spaceship | hurtling towards our destination at unimaginable speeds | We have encountered a few minor setbacks | but rest assured | we are still on course and will arrive on schedule | I understand that there may be doubts and uncertainties | but I promise you that my commitment to our mission is unwavering | We will establish a new home on Mars | and I will be there with you every step of the way | Thank you for your understanding and your support | and I look forward to seeing you all soon |

Except the last trailing | causes the clip to end in a weird groan lmao.

periodsandcommas.zip seperatorsonly.zip

n8bot avatar Mar 14 '23 23:03 n8bot

Can't confirm that a pipe changes the output but I am seeing a random voice take over for a sentence or two and switch back to the custom voice it was originally using.

Angrod avatar Mar 15 '23 18:03 Angrod

This prompt was improved further by removing the extraneous spaces in between the sentence and the pipe separator. Also, putting a period at the end instead of a separator helped the ending.

My dear fellow Martians, I know there are rumors going around that I'm not actually on my way to Mars, and I want to put those rumors to rest| I can assure you that those rumors are simply false| I am currently in the cockpit of the spaceship, hurtling towards our destination at unimaginable speeds| We have encountered a few minor setbacks, but rest assured, we are still on course and will arrive on schedule| I understand that there may be doubts and uncertainties, but I promise you that my commitment to our mission is unwavering| We will establish a new home on Mars, and I will be there with you every step of the way| Thank you for your understanding and your support, and I look forward to seeing you all soon.

pipecommaandONEperiodattheendonly.zip

n8bot avatar Mar 15 '23 19:03 n8bot

Thanks, I'll give it another try, it will make longform a heavier lift but the end results will still be worth it.

Angrod avatar Mar 15 '23 22:03 Angrod

you replaced commas with | as well? How does it know to read a sentance with a pause break (for commas), as opposed to just ending the sentance there like it would for a period?

Also does this method help keep the speed and pacing to be more consistent as well? I'm noticing that some sentances are spoken fast or slower which is annoying.

embanot avatar Mar 15 '23 23:03 embanot

Replacing commas was a mistake, and a troubleshooting step. Commas are generally good. Also so is an "em-dash" which is alt+0151: —

I am going to experiment with having no spaces around the pipe separators. Spaces can really ruin things.

n8bot avatar Mar 16 '23 01:03 n8bot

This is very interesting. I am going to experiment with this tonight. Thank you for the tip and let me know if you find any more improvements.

embanot avatar Mar 16 '23 02:03 embanot

I tested it out and it does seem to produce more concistent voices. But one thing I noticed is that it tends to cut off the ends of the last words of a sentance. Is there a way around that?

embanot avatar Mar 16 '23 17:03 embanot

Putting a period at only the end of the prompt can sometimes help with the end getting cut off. I've found that using do_tts.py and doing multiple candidates at once sometimes provides at least one version without the end cut off. The results can be edited together and fixed with audio software.

Play around with spaces, periods, etc. That seems to influence the errors a lot.

oh and a note about do_tts.py, it doesn't support the | character separator I don't think so just do one chunk at a time.

n8bot avatar Mar 16 '23 18:03 n8bot

The em-dash (—) is proving problematic at times, too. But other times it works well. Experimentation is needed.

n8bot avatar Mar 16 '23 21:03 n8bot

It seems that separating the sentences like Hello this is sentence one.| This is sentence two.| This is the last sentence. works very well.

Edit: The format Hello this is sentence one.|This is sentence two.|This is the last sentence. works well too.

n8bot avatar Mar 17 '23 00:03 n8bot

hmm interesting. I will experiment with that thanks!

embanot avatar Mar 17 '23 03:03 embanot

I tried Tortoise in google colab, with 15 GB of vram, and I did not seem to have any problem whatsoever with periods causing second voices to appear. It seems that maybe my low 8 GB of vram I have in my local GPU could be contributing to the errant voices appearing.

n8bot avatar Mar 21 '23 00:03 n8bot

I'm facing the same issue with some of the provided voices, but not necessarily on the second part. It's just that from time to time, a sentence or a fragment of a sentence somewhere in the text will be read with a different voice.

Example with this text.txt:

What makes jee pee tee-3+ so advanced and powerful? Transformer based Large Language Models like Open AI’s jee pee tee have rapidly advanced in quality and capability. Transformers use an attention weight for each word in the input text, regardless of its position, to better consider long-term dependencies in the text and improve understanding and generation of natural language. They also allow for parallel processing which makes them faster to train and less computationally expensive, enabling the use of larger models with more accurate results. jee pee tee-3 and other similar models have a huge number of parameters (jee pee tee-3 has 175 billion params) giving them significant learning capacity. They are trained on all the text content from the public internet (jee pee tee-3 used 570 gigabytes of compressed text) to predict the probability distribution of the next word given prior words in the text. This simple objective can be performed on raw text without human labeling. When done at scale, it is surprising effective at teaching models to "understand" and generate natural language text.

I used the command:

/users/franck/workspace/tts//tortoise-tts$ python tortoise/read.py --textfile text.txt --voice freeman,train_empire --preset high_quality --candidates 1 --output_path results/longtexts

The audio file with the train_empire voice was fine (train_empire.zip), but the audio file with the freeman voice had 1 fragment of a sentence with a female voice ("jee pee tee-3 has 175 billion params) giving them significant learning capacity"): selected segment where the voice is messed up.zip (for the entire speech: full speech.zip).

Franck-Dernoncourt avatar Apr 07 '23 15:04 Franck-Dernoncourt

This prompt was improved further by removing the extraneous spaces in between the sentence and the pipe separator. Also, putting a period at the end instead of a separator helped the ending.

My dear fellow Martians, I know there are rumors going around that I'm not actually on my way to Mars, and I want to put those rumors to rest| I can assure you that those rumors are simply false| I am currently in the cockpit of the spaceship, hurtling towards our destination at unimaginable speeds| We have encountered a few minor setbacks, but rest assured, we are still on course and will arrive on schedule| I understand that there may be doubts and uncertainties, but I promise you that my commitment to our mission is unwavering| We will establish a new home on Mars, and I will be there with you every step of the way| Thank you for your understanding and your support, and I look forward to seeing you all soon.

pipecommaandONEperiodattheendonly.zip

I find for the most part it switches voices when the input is split into separate lines, which can happen in multiple places in the code. Under the hood what you are doing by removing periods and putting in "|" or the em-dash is preventing splitting, unless it is forced to split because the text is too long. Putting in those characters also runs a special token, [UNK], through the model, which I've found can often work in place of periods. However, this is not ideal, because [UNK] slows down performance and sometimes produces undesirable results.

I don't know if there is a good solution without rewriting quite a bit of the splitting code, which I am not sure I want to do.

Edit: Don't use "|" with read.py. It is used to split the lines in that script.

Edit 2: Maybe "|" produces good results by splitting the text, but if so I can't figure out why. Maybe resetting the context encourages the model to use the most common/predictable/general voice from the input clips and/or trained weights.

spottenn avatar Jun 23 '23 21:06 spottenn