dsnote icon indicating copy to clipboard operation
dsnote copied to clipboard

TTS: Coqui Changing Accents on ts own

Open kremesch opened this issue 6 months ago • 5 comments

Not sure if this is a coqui or Speech note issue.

When using Coqui XTTS v2.0.2 /en, it will change accents from US to UK to Southern in a single paragraph. It doesn't do it all the time. It is random and inconsistent in its behaviour, and it's quite jarring to listen to. Changing a punctuation mark to force it to re-read the problem sentence sometimes fixes it. Other times, simply restarting the app will work, but then it has to process each sentence again. This can be time consuming when prepping a large document for audio studies. Being able to read selected text would be enough to reduce the latter.

I don't know if it's possible, but it seems a solution would be to force a language model on it when using it so it doesn't make assumptions on what part of the world its from. In my past windows experience, this was generally controlled by the app hosting the voices. If this is something I should be reporting to cocqui, please let me know.

It's funny to listen to, honestly. It sounds like my TTS has some Personality issues.

kremesch avatar Jun 29 '25 15:06 kremesch

It's pretty funny, I have to say :) I also understand that it can be quite annoying too. I like XTTS a lot, but like you, when working on texts I had to add extra ,. to force a certain pronunciation or just fix the error. I don't think this problem can be fixed in general. The XTTS model is fixed and only retraining could possibly improve it. The company behind XTTS - Coqui - has closed, so unfortunately they will not help.

The obvious solution would be to use a model that has been trained on single individual speakers like the Piper voices (Speech Note has hundreds of them) but you would not be able to use voice-cloning capability.

In version v4.8.0, I added a new F5-TTS model especially for voice cloning. It supports English and Chinese language. Have you tried it out? Maybe you can get better results. F5-TTS is super slow on CPU, so NVIDIA or AMD card is needed to use it.

mkiol avatar Jul 01 '25 17:07 mkiol

I did try the F5-TTS, but the speech it generates is garbled. I'm not sure if I'm doing something wrong, or if there is something I have to do. It's almost like it's speaking a different language. I'm not sure what's going on with it. I tried it on my laptop just to see if it was my system, but it did the same thing. Installing the AMD add-on seemed to break things more--no speech was generated from both coqui and F5

Is there something different I need to do to clone a voice for it? I have various samples that are short and long.

I've been creating wav files and importing them with the text pasted into the text field . It works great with coqui, but something seems broken when I use the same sample with F5. I haven't played too much with F5 since coqui works well enough. I plan to revisit it with each update though.

TTS is a bit of passion-hobby for me since I've learned to use it decades ago to help with my dyslexia. The humanized voices have been a game-changer. Having this app and these voices for Linux is literally the best thing since sliced bread for me :)

EDIT: nevermind what I said about F5 not working. I did some experimenting after this and found the sweet spot for the length of wav files. In my case, 10 seconds is a perfect balance for the speech engine. When I trimmed them to 10 seconds, the tempo and speed of their speech was perfect. Anything too long turns into someone talking too fast, or an alien screaming at you. This gives me some ideas on the tempo of the speaker. I just wish I could use the Add-on. My card is a RX 7800 XT. I thought it would be supported, but F5 throws me an error if I install it.

kremesch avatar Jul 01 '25 22:07 kremesch

I didn't realize there were more options in the settings. I never maximized the app and never saw all the options. I feel kind of silly. I got the AMD add-on to work. I just needed to ovveride it.

So sorry for all the silliness.

Might I suggest that the menu be more dynamic? This was all I was seeing for the longest time. lol

Image

kremesch avatar Jul 02 '25 03:07 kremesch

The user interface is probably the worst part of this app - sorry for that.

I have tried to make the UI usable for all types of screens, including mobile phones. Yes, you can install Speech Note on a Linux phone and it works. When the screen gets small, instead of the tabs "General", "User interface", "Speech to text" and so on, a combo box appears in the settings (that at the very top). You can switch to different sections by selecting different options in that combo box widget. I thought this was a good UI on a phone, but you clearly show that this still needs to be improved.

I did some experimenting after this and found the sweet spot for the length of wav files. In my case, 10 seconds is a perfect balance for the speech engine. When I trimmed them to 10 seconds, the tempo and speed of their speech was perfect. Anything too long turns into someone talking too fast, or an alien screaming at you.

Thank you for your observations. I will try to improve the description or "help menu" in the app to guide the user on how to create a good audio sample for F5.

TTS is a bit of passion-hobby for me since I've learned to use it decades ago to help with my dyslexia. The humanized voices have been a game-changer. Having this app and these voices for Linux is literally the best thing since sliced bread for me :)

I'm very glad that Speech Note is helpful for you. That's the main reason why I'm writing this app - to make it useful for other people :)

mkiol avatar Jul 04 '25 14:07 mkiol

Thank you for your observations. I will try to improve the description or "help menu" in the app to guide the user on how to create a good audio sample for F5.

I've been experimenting pretty heavily these last couple of days. Here are some observations, if it helps.

For Cocqui:

  • A 1-2 minute sample offers a great balance for expressive speech. The longer the sample, the less artifacts. Smaller samples work well, but the voice is less expressive and artifacts are sometimes present. Samples longer than 2 minutes also worked well, but they seem unnecessary.
  • I found I could create different expressive voices by creating different samples of varying lengths. It was kind of fun to discover. I also discovered the expressive speech is heavily reliant on punctuation. If I don't like the way something sounds, fixing it is simple by changing the punctuation at the end of the sentence (sometimes adding or removing punctuation mid-sentence will give better results).
  • I also noticed sentences that end in a period will sometimes add strange pronunciations at the end. Changing it to a comma fixes it.
  • Changing accents is fixable by breaking up the output and stitching it back together in an editor.
  • Cocqui 2.0.2 is the most natural sounding (the best for creating files to listen to later)--this is the one with an identity crisis though (changes accents for no real reason). 2.0.3 is too expressive (over the top)--might be fun in conjunction with another engine for a project. YourTTS is perfect for proofreading (it reads in real-time with little to no load). It isn't as smart with pronunciation, but that is easily fixed in the Rules by changing it to some other pronunciation that is similar.

For F5:

  • A sample under 10 seconds will insert strange artifacts and words that don't exist. Samples with long pauses between speech will do the same.
  • A sample between 10 - 10.5 seconds gave near-perfect results. The latter resulting in faster speech.
  • Anything over 11 seconds confused the engine into terrible fits of yelling, stuttering, and speaking a fictional language.
  • Samples with more dead space spoke slower. Samples with less dead space spoke faster.
  • F5 is less reliant on punctuation and sometimes breaks when it doesn't like how a sentence is structured. Changing random letters to capitols (or vice versa) altered the way the text was spoken. Capitol letters offered more emphasis.
  • There are some other issues with F5 that I've been trying to work out. Sometimes it just doesn't want to read a sentence, or it will only read part of the sentence, and nothing more. Removing all punctuation and playing with capitols seems to correct it in most cases. Other times, changing the speed the text is spoken is the solution.

Tips for samples (works for both engines):

  • Keep the sentences short, random, and trim any dead space so it has a natural flow. 2-3 word sentences work fine.
  • Remove all punctuation from the text in your voice samples to keep them sounding more natural for whichever engine you're using. F5 is particularly sensitive to this and removing the punctuation is a requirement.

In short:

  • Coqui likes longer voices samples, roughly around 2 minutes.
  • F5 likes short voice samples, roughly around 10 seconds.

kremesch avatar Jul 04 '25 17:07 kremesch