Multiple Phonemizer Support
The piper-phonemizer setup is a bit confusing at the moment as it's both a included with some significant code and a library imported at runtime. The two phonemizers text and espeak are both tightly integrated in piper and piper-phonemize. Furthermore, they are linked with the espeak-ng library which has the GPL license meaning piper-phonemize is also under the GPL license (when distributed) and thus also piper is under the GPL license.
My proposal is this:
- Create a standard interface for a phonemizer between piper/piper-phonemize. This could be 3 functions: initialize, phonemize, terminate. The initialize could also pass in configuration data if required.
- Have the phonemizer be selectable at startup via a flag instead of from the voice config. I'm not sure technically if there's a reason the phonemes are configured in the voice .json file, but it seems like that's not entirely necessary as long as the phonemes match.
- Separate the phonemizers within piper-phonemize to be different libraries that are loaded only if the configuration requires it. For example on Linux to phonemize text into a vector of phonemes using espeak:
auto libraryHandle = dlopen("phoenmizer_espeak.so", RTLD_LAZY);
auto phonemizeFn = (void (*)(const std::string, std::vector<std::vector<Phoneme>>&))GetProcAddress(static_cast<HMODULE>(libraryHandle), "phonemize");
phonemizeFn(text, phonemes);
This would allow an easy way to integrate a new phonemizer without updating both programs and even allows a new library to be added without updating piper-phonemize. Plus, the dependency on espeak-ng would be optional which means it could be distributed under the much more permissive MIT license.
I can implement some of the changes to do this, but as it would be a fairly substantial change, I thought it would be best to discuss it first
I was going to create a similar issue. Thanks to the author for all the hard work. Really cool project.
Judging by the comment here: https://github.com/OpenVoiceOS/ovos-tts-plugin-piper/issues/2#issuecomment-1579658136 I think the problem is that different phonemizers generate different IPA characters when phonemizing (because the author said piper models would likely need to be retrained for a new phonemizer).
So if another phonemizer generates a sequence of IPA characters the current models aren't trained on speech synthesis isn't going to work. There is a function in this repo phonemes_to_ids which will pass back "missing phonemes" if you feed piper phonemized text it doesn't understand (which an alternative phonemizer may or may not generate). I don't think the current phonemizer supports every IPA character so it's likely that just swapping in a new phonemizer isn't so easy.
Ideally if there was another phonemizer out there that restricts itself to output only the IPA characters espeak-ng currently uses it would be backward compatible with the already trained models. As long as the alternative phonemizer generates IPA characters that maps to an id that the piper models understand it should work. I don't think this would be a GPL issue as long as the new Phonemizer uses its own algorithm to phonemize (it wouldn't be a derivative of espeak-ng). I don't think you can GPL the alphabet.
I'm considering two alternatives to espeak-ng to avoid licensing issues:
- Using text phonemes with byte-pair encoding (BPE), possibly with [pre-trained sentencepiece models] (https://github.com/bheinzerling/bpemb)
- Reviving the gruut project and porting it to C++
In both cases, I expect that all of the voices will need to be retrained.
For option 1, I don't think training from a base English voice will work as well anymore because of differing character sets. Option 2 will have limited language support, and the licensing on the phoneme dictionaries is completely unknown (many have been floating around the internet for years without attribution).
Here's another question, likely with no answer: if I were to implement my own (clean room) copy of eSpeak's phonemization rule engine, would the dictionary data files be usable without a GPL license? I see 3 other licenses in the espeak-ng repo (BSD, Apache 2, UCD), so I have no idea what applies to the source code vs. data files.
The espeak library is pretty good at it's job and doesn't necessarily need to be replaced, it just needs to be less tightly coupled to piper so someone could swap it out with a different library if they wanted.
Then the phonemizer could be espeak or gruut or sequitur or whatever. Making a new phonemizer is a big endeavor and there's no need to re-invent the wheel.
Here's another question, likely with no answer: if I were to implement my own (clean room) copy of eSpeak's phonemization rule engine, would the dictionary data files be usable without a GPL license? I see 3 other licenses in the espeak-ng repo (BSD, Apache 2, UCD), so I have no idea what applies to the source code vs. data files.
The rules files at least have the GPLv3 license at the top, I imagine the dictionary would as well, but it's not TOO difficult to find dictionary files.
The phonemizer appears to be tightly coupled to piper because the voice models piper uses understand the phonemes espeak produces. There isn't a universal way to phonemize. As the author said he expects that all the existing voice models would need to be retrained for a different phonemizer. If you have to train a new voice model per phonemizer that isn't going to scale.
I tried swapping in a different phonemizer but it phonemizes in a different way than espeak; it uses some phenomes that espeak doesn't use and vice-versa. I think I can remap some of the phonemes in the replacement phonemizer to equivalent ones the model understands to mitigate but it looks like it is going to be a bit hairy.
A less sophisticated yet still complicated approach is to build a phonemizer using all the same phonemes as espeak (no more and no less). The IPA characters themselves can't be GPL'd. If you could GPL the alphabet all written text would be considered a derivative.
I don't think you can GPL a map table: ["Apple" : 🍎] but the dictionary data files seem to be a bit more than that? I would think using them directly would probably require you to license the new phonemizer under the GPL and would probably be better to avoid.
A phonemizer that outputs the same IPA characters would be backward compatible, though perhaps constraining oneself to use only the phonemes that espeak does would feel too restricting. That would perhaps be one of the tradeoffs for trying to be a "swap in" replacement for espeak.
@SeymourNickelson I actually did train an "eSpeak compatible" phonemizer in gruut; there are separate database files for that. It works OK, but espeak-ng is a bit more sophisticated than you might expect. It handles some part-of-speech dependent pronunciation rules (for English at least) like "I read a book yesterday" vs. "I read books often". Additionally, it's able to break apart words somewhat intelligently: like pronouncing a username hansenm as "Hansen M".
@kbickar I don't want to reinvent the wheel, but the licensing question comes up quite frequently. Similarly, using the Lessac voice as a base adds more questions when people want to user Piper commercially. While I sympathize with the GPL philosophy, I prefer to keep my stuff MIT/public domain. And if I'm going to suggest people contribute pronunciation fixes, etc. it makes more sense to do it for a project with fewer restrictions.
At least the ability to train a new base model from scratch is relatively straight forward so creating a model without the lessac dataset can be done and use piper out of the box.
Some sort of plugin interface would be great
@SeymourNickelson I actually did train an "eSpeak compatible" phonemizer in gruut; there are separate database files for that. It works OK, but espeak-ng is a bit more sophisticated than you might expect. It handles some part-of-speech dependent pronunciation rules (for English at least) like "I read a book yesterday" vs. "I read books often". Additionally, it's able to break apart words somewhat intelligently: like pronouncing a username
hansenmas "Hansen M".
Cool! I'll have to check out Gruut. It seems eSpeak tries to go the extra mile phonemizing (I haven't looked at the internals) but it definitely doesn't handle everything perfectly either. In my testing it didn't handle "I read a book yesterday" properly. I wonder if there is a good open source "part of speech tagger" out there that input text could be fed to first before phonemizing, which could be used to disambiguate those words pronounced differently in different contexts.
Unfortunately for me I'm not working in Python so I'd have to port Gruut to my native programming language (which isn't C++ either, although a C++ version would be more accessible for my target platform). Might be worth it. I just did this (ported from Python) with another phonemizer but unfortunately that one phonemizes every word independently and not in the context of surrounding words; it doesn't try to handle some of these complex pronunciation rules you mention.
The supported language list of Gruut would be enough for me so if you did port that to C++ for Piper at some point maybe those needing a phonemizer in another language could fallback to espeak.
from a dev POV, I would like to see gruut as an option, and honestly would love to see a c++ incarnation that is continuously updated, like this project is now tackling license issues instead of focusing on the code the same will happen to future projects that use espeak (i expect that to not be uncommon, due to lack of alternatives), a permissively licensed phonemizer to replace espeak would benefit the whole voice ecosystem and help future devs and projects avoiding this same issue
let's assume gruut voices sound worse than espeak voices, from a user POV would be nice if piper supported both gruut and espeak voices, just by making espeak optional that makes piper GPL free, then using a voice that needs espeak will drag the GPL license, but that is then voice specific and not library specific, users can use whatever voice sounds best to them, espeak or gruut based voices, a user won't care about GPL
i understand this means at least double the work, without even counting the time to port gruut to c++, totally understandable if it's not feasible but i wanted to leave my 2 cents
1. Using text phonemes with byte-pair encoding (BPE), possibly with [pre-trained sentencepiece models] ([bheinzerling/bpemb](https://github.com/bheinzerling/bpemb))
I recently came across this paper https://assets.amazon.science/25/ae/5d36cc3843d1b906647b6b528c1b/phonetically-induced-subwords-for-end-to-end-speech-recognition.pdf and I previously also played around with this repo https://github.com/hainan-xv/PASM
this is a bit over my area of expertise, but you should be able to understand the nuances better and judge if its applicable or useful
quote
Closer to our approach is the Pronunciation Assisted Subword Modelling (PASM) that was shown to outperform BPE and single character baselines [27]. Subword generation in PASM is based on consistent alignments between single phonemes and single characters. A downside of this approach is that it tends to choose short subwords and avoids modelling full words with single tokens. As a consequence, subword variability is limited and, along with the method’s exclusion criteria, the resulting vocabularies are relatively small (around 100 and 200 subwords for WSJ and Librispeech respectively). We compare our results to PASM in our 200 subword experiments.
apologies if this is irrelevant, but since you mentioned BPE i thought it could be helpful
Not directly related maybe in adoption to general purpose phonemizers.
I did integrate our Icelandic phonemizer as an alternative to eSpeak into piper directly, because the Icelandic version of eSpeak uses an old IPA symbol set and additionally the normalization is not very good for Icelandic. E.g. homographs, dates and numbers, I am looking at you ...
The integration was not really difficult, took me half a day or so, because our pipeline is also Python-based: see https://github.com/grammatek/ice-g2p for the phonemizer. I changed in Piper however the symbols and only used those of our alphabet. As I am training from scratch and don't want to fine-tune any existing model, that's probably ok.
We are using X-SAMPA by default in our grammars and symbols, but remapping this to IPA is just a lookup. See https://github.com/grammatek/ice-g2p/blob/master/src/ice_g2p/data/sampa_ipa_single_flite.csv
We also have an Android App that uses the C++ library Thrax for G2P: https://github.com/grammatek/g2p-thrax. Thrax is totally rule-based and does not perform as good as the other G2P module, but good enough for most purposes.
The former uses a BI-LSTM for the conversion which is pretty good. But Icelandic is also very regular with pronunciation and only homographs need to be treated specifically.
What we do additionally is using a very big G2P dictionary to speed up our inference time. This just needs to be processed once in a while offline and then you can use it efficiently at runtime. If you are processing a large enough corpus of a specific language, you will get a very good coverage for most words. And homographs can be chosen dynamically depending on some rules/model instead.
What we found absolutely necessary for text normalization is to use a PoS tagger. We also have trained a BI-LSTM - based PoS model within our Icelandic language technology program, but there are also some other alternatives available. For Python, e.g. StanfordNLP Stanza with Apache 2.0 license.
@lumpidu Thanks a lot for sharing. Very informative.
I integrated another Phonemizer in Python to use in the Piper training script (basically followed the training guide). The only dependency I didn't install is piper-phonemize (I just stubbed in my own Python module) that returned the expected data for preprocessing (shimmed in all the values from the replacement phonemizer).
Because this phonemizer uses different symbols than espeak I also need to train from scratch. Do you mind sharing what hardware you are training on? I can't get piper to train on the GPU on my Apple hardware (and I'm not sure if even if I could get training on the GPU , if it would be fast enough). Google Colab keeps throwing me of my training session before I can finish even though I still have compute credits. Colab feels like a weird system, just throw a paying customer out in the middle of a training session at any time, no questions asked; delete all the data and keep the money!
@SeymourNickelson : sure, we use our own compute hardware, a Ryzen Threadripper Pro Workstation with 32 cores, 512GB RAM, lots of SSD's and 2x A6000 Nvidia cards. There is also a 3090 card inside, that I mostly use for inferencing. I am currently training an xs modell (with smaller parameter size but 22.05 kHz files) on my 2 x A6000 cards. This model is meant for inferencing on the Android phone. Training runs smoothly with now a bit more than 1500 epochs after overall almost 2 days, i.e. ~ 110 seconds/epoch with a > 17.000 files dataset. Because these cards have 48GB RAM, I use a batch_size of 64, symbol_size of 600 and still the memory is not halfway filled. I have no experience with Google Colab. We decided against using cloud GPU's one year ago, and owning a dedicated GPU workstation has a lot of pros that I don't want to miss. OTOH, I would use cloud GPU's from some of the usual suspects for trainings that need longer time than, say, 2 weeks.
Flite and Festvox also have their own phoneme engines. Flite for example is an unlicensed/BSD style custom license. I believe it uses a version of cmu http://www.speech.cs.cmu.edu/cgi-bin/cmudict and Letter-to-Sound rules. I like that it doesn't need runtime files since it converts the cmudict into C code structures and compiles them in directly. From what I've seen the phonemes match espeak well (if you look at the Flite repo, there is a PR at the top there that converts Flite phonemes to IPA phonemes, this is the version)
Its truly kind of wild that there is no simple and straightforward C/C++ phonemizer. I ported Piper over to Zig for fun https://github.com/sweetbbak/sayu and espeak was kind of a pain to deal with. Same with libonnxruntime if I'm being honest.
I'd love to work on porting Flites phonemizer over to C/C++ or Zig lol
We have gladly turned away from Flite/Festvox and never looked back ! Building all the dependencies in the right order alone is a nightmare. There are so many compatibiltiy issues and C/C++ compiler warnings, this is not even funny. Then the authors have obviously abandoned the project (last commit was 3 years ago).
I think the most straightforward method is a Dictionary lookup created by a good model (you can also use ESpeak to generate that) in combination with a small LSTM model for OOV for your language and using ONNX Runtime for inference. Or using the Thrax-approach as we used for Icelandic: but you need to create custom rules for each language.
to be clear, I'm talking about a community effort writing an entirely new phonemizer based on Flite or Flites methods, and not using it directly. Trying to read Flite/Festvox code feels like reading gibberish though, its quite dense. What I personally don't like about using a model is the necessity to pull in dependencies to run that model, but overall its a simple solution, so theres a tradeoff there.
I just think it would benefit a lot of people if there was an MIT or Unlicensed tokenizer/phonemizer that was lean and fast. But at the same time I'm just a text-to-speech enthusiast and I don't really have any skin in the game.
Ideally it would be an open interface so you could build Piper and build a phonemizer to go with it. Unfortunately it's pretty tightly integrated with headers and data structures making it hard to build Piper without espeak
I wouldn't remove eSpeak from Piper, rather add another option. As I said: it's relatively easy to create for many languagse a large enough pronunciation dictionary given you have enough vocabulary to go with (or a large text corpus to derive it from). There will be OOV words, but most of the time (>99%) you can derive the pronunciation from that by generation via e.g. eSpeak or CharsiuG2P. You can even use any GPL code to create the dictionaries without the need to link them in at runtime. This is the fastest, most light-weight way for G2P.
There are already a large amount of dictionaries available: CharsiuG2P
eSpeak could still be shipped with Piper, it would just be better to allow PiperTTS to be built without dependencies on eSpeak since that means it is under the GPL 3.0 license.
I believe you can build a model using the "text" phonemizer to just send the phonemes which allows for one to be used as a preprocessing step. That allows for dictionaries or other phonemeizers to be used, just not as convenient.
Yeah, one could just remove the dependency and document the steps necessary to integrate it. This way, all the calling code could just stay, and the dependencies would be optional and pluggable.
is it possible to do CharsiuG2P inference without using Python? It looks promising but that seems excessive. I don't see any examples of such. You could naively do word look ups for most words I guess.
It appears to need these at the very least:
from transformers import T5ForConditionalGeneration, AutoTokenizer
from segments import Tokenizer
I'm not sure how you would even go about that? At that point you may as well just shell out to espeak-ng/flite/festival etc...