dsnote Making sense of all the PiperTTS models

Hello,

The work you are doing and the work the people at PiperTTS are doing is amazing.

One thing that is confusing me though is the large amount of Piper related models within the application. Two examples being LibriTTS and LibriTTS-R.

They seem to be coming from the same HuggingFace folder but there are at least 50+ models for each of these options.

How exactly would one know which model to install or how is each one different from the others?

piper-models

Feb 20 '24 18:02 Kentoseth

Thanks for the question.

In general it is hard. The problem here is that there are too many English voices! I didn't want to decide which one was better and which one was worse, so I included them all. Maybe it wasn't good strategy, maybe I should make some selection but I didn't. Everything what is available in Piper is also available in Speech Note.

Piper LibriTTS and Piper LibriTTS-R are multi speaker models, so one checkpoint file can generate many totally different voices. In Speech Note every "voice" is presented as separated model but under the hood all LibriTTS/LibriTTS voices use the same checkpoint file. The file is downloaded only once, so no worries.

The names "P7910" or "3615" comes from original name of speaker in the training data. My initial idea was to add at least a male/female indication to the name, but I gave up because there are just too many of them. That's why you see these long and meaningless names :/

In this particular example, LibriTTS-R is a restored version of LibriTTS corpus. Voices are similar but LibriTTS-R should sound a bit better.

Feb 20 '24 20:02 mkiol

Piper LibriTTS and Piper LibriTTS-R are multi speaker models, so one checkpoint file can generate many totally different voices. In Speech Note every "voice" is presented as separated model but under the hood all LibriTTS/LibriTTS voices use the same checkpoint file. The file is downloaded only once, so no worries.

My suggestion to fix this issue of many voices using one model is to enable downloading of the model as only one option. And then within the main interface, a person can choose the many different voices that are available. Similar to how the CoQui X-TTS model works, where a person can choose different voice options from the main interface.

If you choose to do that, then I can go through some of the different voice models and add some metadata to them to indicate whether it is male or female.

I can submit this file as a text file here in the GitHub issues. Or you can indicate your preferred format and I can provide that for you. So that adding it to the application will be as easy as adding the file and linking to the file's data.

Feb 22 '24 12:02 Kentoseth

My suggestion to fix this issue of many voices using one model is to enable downloading of the model as only one option. And then within the main interface, a person can choose the many different voices that are available. Similar to how the CoQui X-TTS model work

That's a very good idea! 👍🏿

If you choose to do that, then I can go through some of the different voice models and add some metadata to them to indicate whether it is male or female.

That would be super great :) I will let you know when it is ready. Most likely I won't be able to implement this in an upcoming release, but later.

Feb 22 '24 12:02 mkiol

This is an issue. I see it mostly as a user interface issue. I think voices selected from the long list need to be moved to a short list where they can be found more easily auditioned and deleted from the short list if not wanted. I'm of the opinion that a limited number of quality voices is better than a massive list of mediocre voices. What makes a good voice? It seems to me some voices are more distinctive, and do a better job of making printed text both listenable and comprehensible . It's the same thing when a real person is auditioned for a actual part in a play or drama. How good is their delivery? The same thing could be said about piano competitions. Everyone is playing the same pieces usually quite well, but a few are considered to be superior. This issue right now is the ease of management of the voices. by the end user. This is an awesome project and so much better than anything I tried on Linux in the past. I feel end users just need a space where they could talk about the project without submitting an "issue." Probably end users collectively could figure out a short list of useful voices. It's more complicated with all the languages supported most of which I will never use. I basically have two uses for a program like this. Language learning, and the creation of audio books. I've made some examples but for some weird unexplained reason .mp3 files aren't accepted on the gitub platform which is ridiculous for a project of this type. Are they concerned people will post copyrighted songs?

Some thoughts on voices. Sometimes you just want the purest form of speech possible without much personality in a male and female version. Other times it would be fun to have distinctive personalities such as a person reading English with a French, German, Russian, Chinese, or Japanese accent. Sometimes it would be fun to have recognizable voices like Obama, Hillary or Trump. Generally I don't care for American English when creating an audiobook, on the other hand a strongly accented southern voice or other regional voice would be useful. Right now there seems to be a massive list of nondescript voices that generally lack personality. Think about jazz sax. What makes a great sax player? It's partially the base tone, but a lot more about how the player bends individual notes.

Jun 22 '24 16:06 gbodley

The new version 4.6.0 brings some improvements. It's not exactly as @Kentoseth described, but it's better. Multi-voice models are grouped, so navigating the model browser is a bit easier.

The new version should be available in Flathub tomorrow.

Aug 03 '24 13:08 mkiol