tortoise-tts
tortoise-tts copied to clipboard
A possible approach to pronunciation customization
Hi, I'm going to re-raise the topic in #12, which is currently closed. I apologize, and I appreciate that this is in some sense bad form.
I also would like the ability to, occasionally, fine-control pronunciation, and I am of the belief that fundamentally it's not a machine solvable problem, thanks to the literal nightmare that is last names. I know six people who have the same last name by codepoint, but none of them say it the same way, and there's nothing your software could ever do to cope with that, because it's unavailable contextual knowledge.
The problem is, if you want to do high quality rendering, getting names right is a sign of respect, so this genuinely matters, and I believe needs to be in some way droppable to user control.
And so I was going to go bug the ocotillo author. Hm. Guess that works out nicely.
I don't entirely understand where the English <-> Audio mapping comes from, but on a quick glance, it looks like it might be in jbetker/wav2vec2-large-robust-ft-libritts-voxpopuli
.
And so I was wondering.
- How hard would it be to have two of these?
- If the underlying symbolic language was in some way deterministic with regards to end pronunciation - that is, it's somehow a
least worst case
- how hard would it be to adapt the jbetker thing to a second syllabetry?
The reason being, y'know, the International Phonetic Alphabet is in Unicode, and does a pretty reasonable job with most real world languages. And that would reduce the job to Googling someone's name once, putting it in a lookup table in IPA, and promptly forgetting about it for eternity.
Which, to me, sounds pretty good.
Or, if you prefer, ask from Siobhan and Pádraig Moloughney from Worchester, Massachusettes
("shavon and petrick molockney from wooster mass".)
Let's talk to [ipa:ʃəˈvɔːn] and [ipa:ˈpˠɑːɾˠɪɟː mʌːlɒkːniː] about it
is nicely unambiguous, and fits with the symbology in the other request
This sounds like what 15.ai does.
Phonological conventions and ARPAbet You may notice that certain words are pronounced slightly differently than what you might expect. For example, suppose that you wish to have a character pronounce the word "Internet" as /ˈintərˌnet/ with the phoneme ⟨t⟩ clearly enunciated in between the first and second syllables. The algorithm, however, may tend to favor the pronunciation /ˈinərˌnet/, opting to elide the ⟨t⟩ — the way that most Americans would pronounce the word "Internet" in everyday speech (this phonological phenomenon is known as intervocalic flapping). If you wish, you can override the AI's preference by inserting ARPAbet strings wrapped in curly braces {} — in the case of "Internet," you would use the input {IH1 N T ER0 N EH2 T} instead to explicitly instruct the model to pronounce the ⟨t⟩. In most cases, however, the AI does a very good job guessing the most appropriate pronunciation.
Lexicon The lexicon that the model uses is an amalagmation of various dictionaries, words that have been scraped from both physical and digital references, and AI-generated phonetic transcriptions. While the lexicon contains many obscure and topical words that are not typically found in phonetic references (for example, the word VTuber — /ˈvēˌto͞obər/ or {V IY1 T UW2 B ER0}), it is by no means complete. If you find a word that does not exist in the lexicon, or if you believe that a word may be incorrectly transcribed in the lexicon, feel free to send me an email or tag me in a tweet.
Great Idea.
@neonbjb You reference in the Design Document
Text is converted to tokens using a custom 256-token BPE lexicon which was trained on the text side of the speech dataset used. Speech is converted to 8192 tokens by the VQVAE mentioned above.
I would assume that this tokenizer works under similar principal. This is to say we just need an interface to provide the tokens already mapped, { custom 256-token's... } , as input to model.
Though looking over data\tokenizer.json
, looks like the tokenizer is more basic?
Ok, yeah, their ... their typeable syntax is honestly much better. 😅
Maybe supporting both might be nice though since it's just a character conversion, and since most of the copy pasta presentations of formal pronunciations in sources like Wikipedia are in IPA
Might also help address #9
The default voice probably sounds like a midpoint between the bag of English-speaking accents, meaning anyone speaking X will perceive it as being weighted towards Y
Emphasizing the rhotic R may "Americanize," et cetera
I have been thinking about this over the last two days. In retrospect, I think it would have been absolutely possible to have trained Tortoise to speak both conventional alphabet and phonetic alphabet. There are plenty of datasets out there that use the phonetic alphabet that I could have inserted into training (or I could have trained a wav2vec2 model to transcribe into phonetic AND conventional and then picked one version at random while training Tortoise). So I guess the answer to the question/suggestion here is "yes - I am pretty sure that this is possible".
As it stands, though, if I wanted to train Tortoise to be able to speak the phonetic alphabet, I'd need to change its symbolic lexicon. I'm a bit nervous that this will involve re-training the autoregressive transformer.
I'm willing to try making this fix, because I agree that this would be a major feature addition, but I cannot currently commit to it. My priority right now is implementing a feature to support the suggestion from #16 because I think the finding there is super cool and it won't tie up my GPUs, which are currently working on something else. :)
Lets keep this open, and I will try to get around to it.
I just want to check - do you all think that my description of the problem and potential solution is correct? Can anyone think of a way to fix this without having to re-train (or at least fine-tune) the autoregressive transformer? (As a quick summary: it converts text into very low-resolution audio signals, and is responsible for all the high-level rendering like tone and pronounciation).
I like your description of the problem. I do not understand the domain well enough to comment on a solution.
it won't tie up my GPUs
How much iron does someone need to bring to the party to help?
I have a couple comments.
If you are using a custom dataset. I know you mentioned you used ocotillo to label so...
- Wav2Vec2Phoneme uses the exact same architecture as Wav2Vec2
- Wav2Vec2Phoneme is a speech model that accepts a float array corresponding to the raw waveform of the speech signal.
- Wav2Vec2Phoneme can be fine-tuned on multiple language at once and decode unseen languages in a single forward pass to a sequence of phonemes
- By default the model outputs a sequence of phonemes. In order to transform the phonemes to a sequence of words one should make use of a dictionary and language model.
The Wav2Vec2Phonemes now become the basis for your input tokens. Requiring retraining. But that assumes accuracy.
Secondly, is it worth having both a conventional alphabet and phonetic alphabet in the end. Shouldn't all text be tokenized into the best representation it can be. I get why you would want to continue to utilize what the model knows now, but that seems like a mistake as the phonetic alphabet is vastly more descriptive. Again this is not my area.
You could even train a model to tokenize unseen words, and then allow an override as 15.ai does.
Before pursuing a larger model, a solution for intonation should be found and integrated. I'd hate to see a large version of this model not controllable. https://github.com/neonbjb/tortoise-tts/issues/10#issuecomment-1113539858 is cleaver but something without the need for references would be great too. But as a control scheme, yes 100%, very valuable.
Even though you have no formal paper published that I know of, you could reach out to
Yannic Kilcher
https://www.youtube.com/c/YannicKilcher https://discord.gg/4H8xxDF https://twitter.com/ykilcher
He has a youtube series called (with the authors), where he brings on authors of papers he reviewed on his youtube channel. He recently had the guys from Laion on, who seem very focused on releasing and open sourcing large models and large datasets, I am sure you know but they have replicated CLIP. He might be able to get you into contact with them, perhaps that will get the ball rolling on the larger model. Contacts of contacts of contacts....
I would love to hear more about this work, difficulties training it, inspirations, thoughts on ethical issues.
Finally, your work here by itself is enough to get lots interested in scaling it up. Laion was started by doing some cool stuff, starting a discord and people offering up the money till they got a couple sponsors.
How much iron does someone need to bring to the party to help?
A bit of a tricky question. I would need a system with either 4 or 8 GPUs, V100 caliber with a minimum for 24GB of RAM. I'm sure it could be done on less, but the training time required would start getting pretty untenable.
What's tricky is the storage and network requirements. My dataset consumes ~8TiB of flash storage. I'd need a computer that has the same. More importantly, I'd need to somehow transmit 8TiB of data out of my home network. I guess that'd only take about a week with a good connection. Still, something to consider.
Hey @honestabelink thanks for the continued input.
This looks great, however the only pre-trained model I could find is patrickvonplaten/wav2vec2-xls-r-phoneme-300m-sv and patrickvonplaten/wav2vec2-xls-r-phoneme-300m-tr. I'm not a linguistics guy, but I'm going to guess that swedish and turkish phonemes might not cover the entire english language. Is that right? Does anyone know of a wav2vec2 fine-tuned for english phonemes? Would anyone be willing to take on training one if not? Seems like this could be done with Colab or vast GPUs pretty easily, and it would save me some time getting this working.
Secondly, is it worth having both a conventional alphabet and phonetic alphabet in the end.
I think going 100% phonetic would compromise the usability of the model, right? Unless I also built a text<->phoneme converter. I realize that these probably exist, but I am skeptical of how good of quality they are. I am fairly certain that some text<->phoneme conversions are context-specific. Please correct me if I am wrong that these exist.
Also consider that I am 99.9% certain that if I simply inserted phonemes into the textual lexicon the autoregressive model is trained on, it would "just work" and learn to perform the conversions for me.
Before pursuing a larger model, a solution for intonation should be found and integrated. I'd hate to see a large version of this model not controllable. https://github.com/neonbjb/tortoise-tts/issues/10#issuecomment-1113539858 is cleaver but something without the need for references would be great too. But as a control scheme, yes 100%, very valuable.
Completely agreed. I simply didn't think about this when building Tortoise originally. I agree that this is an important feature to have.
This is a bit off-topic, but regarding marketing: I'm also a huge Yannic fan. Some of the things he covers makes me believe he trolls /r/machinelearning so I'm hopeful he saw my release announcement organically. I'd love to go on the show, at the risk of outing myself as a horrible theorecian :).
I'm still not really sure what I want to do with Tortoise long-term. I like tinkering, not maintaining. This isn't to say I will abandon Tortoise, but I'd like to get the community involved in maintaining this to help with the load a bit. I'm considering reaching out to an organization like Eleuther or Laion to see if they are interested in onboarding it to their "active projects". Thoughts?
Wav2Vec2-Large-LV60 finetuned for phonetic labels outputs
https://huggingface.co/facebook/wav2vec2-lv-60-espeak-cv-ft
Wav2Vec2-Large-LV60 finetuned on multi-lingual Common Voice This checkpoint leverages the pretrained checkpoint wav2vec2-large-lv60 and is fine-tuned on CommonVoice to recognize phonetic labels in multiple languages. When using the model make sure that your speech input is sampled at 16kHz. Note that the model outputs a string of phonetic labels. A dictionary mapping phonetic labels to words has to be used to map the phonetic output labels to output words. Paper: Simple and Effective Zero-shot Cross-lingual Phoneme Recognition
So trained on Common Voice would mean...
CommonVoice (36 languages, 3.6k hours): Arabic, Basque, Breton, Chinese (CN), Chinese (HK), Chinese (TW), Chuvash, Dhivehi, Dutch, English, Esperanto, Estonian, French, German, Hakh-Chin, Indonesian, Interlingua, Irish, Italian, Japanese, Kabyle, Kinyarwanda, Kyrgyz, Latvian, Mongolian, Persian, Portuguese, Russian, Sakha, Slovenian, Spanish, Swedish, Tamil, Tatar, Turkish, Welsh (see also finetuning splits from this paper).
Sounds comprehensive on the surface. Model accurate enough for 8TB 👀
Maybe, https://github.com/pytorch/fairseq/blob/main/examples/wav2vec/README.md
We release 2 models that are finetuned on data from 2 different phonemizers. Although the phonemes are all IPA symbols, there are still subtle differences between the phonemized transcriptions from the 2 phonemizers. Thus, it's better to use the corresponding model, if your data is phonemized by either phonemizer above.
They list an implementation of Simple and Effective Zero-shot Cross-lingual Phoneme Recognition (Xu et al., 2021) as part of their transformer models. So it's looking like yes, maybe?
Fine tuning staring point/code references. https://github.com/kosuke-kitahara/xlsr-wav2vec2-phoneme-recognition/blob/main/Fine_tuning_XLSR_Wav2Vec2_for_Phoneme_Recognition.ipynb
I guess in regards to the phonetic transcriptions, phonemes cross-overs, I will have to go around and survey. On a side note for now, I did see https://github.com/mphilli/English-to-IPA, which is just a dictionary lookup it seems.
The ipa_list function returns a list of each word as a list of all its possible transcriptions. It has all the same optional stress_marks and keep_punct parameters as convert.
The isin_cmu function takes a word (or list of words) and checks if it is in the CMU pronouncing dictionary (returns True or False). If a list of words is provided, then True will only be returned if every provided word is in the dictionary.
Sound like great additions to have access too.
I just don't know what Wav2Vec2 above outputs.
['m ɪ s t ɚ k w ɪ l t ɚ ɹ ɪ z ð ɪ ɐ p ɑː s əl ʌ v ð ə m ɪ d əl k l æ s ᵻ z æ n d w iː ɑːɹ ɡ l æ d t ə w ɛ l k ə m h ɪ z ɡ ɑː s p əl']
IPA is a new acronym for me 😄
As a final note, please don't feel pressured to expand on this project. Honestly you've done an incredible amount of work already. 8TB 👀 + 8x3090's probably for what months. 🏅
This work would live on as an shining example from the opensource community.
Myself, I'll be tinkering around for a bit. So I don't mind implementing some things or spinning up a colab instance, etc. Feel to reach me on discord too, honestabelink#4221, for anything specific.
Final thing, I agree on including the phonemes and it would work, I am just wondering would the model suffer having to learn both sets of tokens, I don't know, probably not really right? Attention is all you need. Also a quick survey on tts models using phonemes should be done so we know what we are in for.
I think going 100% phonetic would compromise the usability of the model, right?
No. This is how most readers worked before the ML community showed up.
Then, reading a language was just representing its words as dictionary lookups into a symbol table. As a result, old TTS was inherently polylingual. Still, it was super bad at prosody. You're very good at prosody.
I don't know that I think 100% phonetic is necessary, but if you chose to do that, it wouldn't be an impediment; it'd be more like you'd be becoming the backend for frontends like Bri'ish English and Murican Anglish and so forth.
Practically speaking, I don't see how dialects could be handled without phonetics, but with them they become a question of finding source data to get it right.=
@neonbjb The idea of using both text and phonemes for training is interesting and, IMO this is the best option, having text as default for simplicity (and very good quality of the box), while still allowing explicit pronunciation control for certain words, if needed. Some words can be pronounced differently based on context (eg read), and if you have 8TB of data then the model can learn that. Also you have weird "words" like haha, uhmmh, heyaah etc.
I think you can simply add more input tokens to the vocab (note, phoneme 'a' should likely be a different token from grapheme 'a') and fine-tune the existing AR transformer model? Maybe freeze everything except of the input token embedding for the first N steps? During training maybe swap some % of words for the IPA pronunciation using some lexicon or gruut.
Awesome project btw :)
I have a mild Geordie accent, from the north-east of the UK, and tortoise seems to flip-flop between making me American and making me British, even within a single clip. I imagine that it's just the case that I'm not represented in the training data. I was wondering whether it'd be helpful to emphasise the differences in your accent in the conditioning clips - eg reading the wikipedia page for your accent and using that recording as your voice sample.
It's one of those times where it'll take less time to just try it, so I'll give it a try and update this.
Update - Helps a little bit when specific words are included in the voice samples you give it, for example it is pronouncing the t in water as a glottal stop (wa'er), but it still doesn't conceptualise what my accent is (eg room should be pronounced rum).
I don't expect it to know every accent, but at least that's one datapoint :smile:
I have a mild Geordie accent, from the north-east of the UK, and tortoise seems to flip-flop between making me American and making me British, even within a single clip. I imagine that it's just the case that I'm not represented in the training data. I was wondering whether it'd be helpful to emphasise the differences in your accent in the conditioning clips - eg reading the wikipedia page for your accent and using that recording as your voice sample.
It's one of those times where it'll take less time to just try it, so I'll give it a try and update this.
Update - Helps a little bit when specific words are included in the voice samples you give it, for example it is pronouncing the t in water as a glottal stop (wa'er), but it still doesn't conceptualise what my accent is (eg room should be pronounced rum).
I don't expect it to know every accent, but at least that's one datapoint smile
Yeah, I bet your accent is probably just a bit too far from the training set. I bet if I fine tuned the model with your voice it would work great. I am preparing a write-up on fine-tuning results with Tortoise right now, stay tuned for that..
:edit - forgot to mention, that I did receive a message from someone on Reddit today about accent control. They claimed that they were able to get the model to reliably produce a desired accent by carefully picking which conditioning clips they provided to it. They said they would start with a large number of clips, ~15, then permute through them until they got the desired results. Might be worth trying. I haven't done any experiments in this area yet.
then permute through them until they got the desired results
Could that be automated? Eg would you expect to see lower losses in the output with a good set of samples? I don't really fancy manually inspecting 15! options
Don't think so. It's a hack to make do with what I've made available. The correct solution is either fine tuning or training a bigger model (which would generalize better).
Can someone who is interested in this feature give me a few examples of english and paired phonetic strings (in IPA) that TorToiSe doesn't handle well? I would like to use them as test cases.
Fundamentally, any word which is distinctly different in American and British English (or Australian or etc) is a valid example.
Consider please this highly sensible and relevant sentence:
Either aluminum advertisement clique or leisure missile schedule niche; tomato scone vitamin stance
I'm sure you've heard this before. It's fresh with the kids.
An American will say "EYE thur - uh LEW min em - ad ver TIES ment - CLEEK - or - lee ZHER - miss ill - SKED jill - nitch - toe MAY toe - scow nnn - VIE tuh min - stans"
This is ˈiðər əˈlumənəm ədˈvɜrtəzmənt klik ɔr ˈlɛʒər ˈmɪsəl ˈskɛʤʊl nɪʧ; təˈmeɪˌtoʊ skoʊn ˈvaɪtəmən stæns
In British english, this will be "EEE thur - al YEW min yem - AD vur tis munt - CLICK - or - lezh YURE (or YEWER) - miss ISLE - SHED yule - neesh - toe MAW toe - ska nnn - VEH ta min - staun sss"
This is ˈaɪðər al(j)ʊˈmɪnɪəm ədˈvɜːtɪsmənt kliːk ɔː ˈlɛʒə ˈmɪsaɪl ˈʃɛdjuːl niːʃ; təˈmɑːtəʊ skɒn ˈvɪtəmɪn stɑːns
Running this fascinating, pulitzer quality text through your system using the voice "Tom" in high quality mode produces nearly universally the American pronunciations.
For me, this is mostly about being able to force pronunciations in names.
excellent, thank you!
Update on this: I've finished transcribing my dataset with facebook/wav2vec2-lv-60-espeak-cv-ft and have started re-training the autoregressive model with the new dual-purpose lexicon. Good news is it didn't blow up and the loss curves look good.
I expect training will take a week at least, but I should know if this approach is actually sound by the weekend.
I've opened up the wandb for this model if anyone is curious to follow along. This project contains all of my training attempts for the autoregressive model. You'll want to watch the latest runs, titled unified_large_with_phonetic
.
https://wandb.ai/neonbjb/train_gpt_tts
A couple notes on this based on my own experiences/preferences that you can take or leave:
- This is indeed an incredibly helpful addition to a TTS system, IMO necessary for most real-world use
- Mixed text/phoneme models are preferable to going all-in on one or the other, due to both the error potential in using a g2p (grapheme-to-phoneme) model at runtime and the cumulative errors that will creep in with automatic transcription of the training dataset
- IPA is preferable to ARPABET because it's inherently multilingual. The latter is used a lot in research projects, but it's an English system by design, so it cuts out a lot of future potential. It's easy enough to provide ARPABET to users as a more typeable option via a simple transformation on input text.
Looks like this made it about 1/3rd of the way through training and was stopped? Any plans to continue?
I've opened up the wandb for this model if anyone is curious to follow along. This project contains all of my training attempts for the autoregressive model. You'll want to watch the latest runs, titled
unified_large_with_phonetic
. https://wandb.ai/neonbjb/train_gpt_tts
Fwiw, lack of effective pronunciation control is a major problem across lots of text-to-speech engines, and this enhancement would likely mean outperforming all commercial products I've tried.
Thanks!
Unfortunately, training from the existing model was not working. The loss curves looked good but the model was producing gibberish in inference. I'm not quite sure what it was learning. I need to dig in with some debugging or re-think the approach, but haven't found the time.
Full disclosure is: it's likely that when I get to this, it'll be a full re-train with a cleaner dataset and leveraging some lessons learned. It'll likely be a commercial product built in cooperation with an existing TTS company.
Love it - thanks for the update. Believe this would be a differentiating feature for sure.
Do you have a commercial partner in mind? I’ve tried many but of course not all products
On Mon, Jun 13 2022 at 11:17 PM, James Betker < @.*** > wrote:
Unfortunately, training from the existing model was not working. The loss curves looked good but the model was producing gibberish in inference. I'm not quite sure what it was learning. I need to dig in with some debugging or re-think the approach, but haven't found the time.
Full disclosure is: it's likely that when I get to this, it'll be a full re-train with a cleaner dataset and leveraging some lessons learned. It'll likely be a commercial product built in cooperation with an existing TTS company.
— Reply to this email directly, view it on GitHub ( https://github.com/neonbjb/tortoise-tts/issues/17#issuecomment-1154663832 ) , or unsubscribe ( https://github.com/notifications/unsubscribe-auth/AAB73ESCSWDL5G46HPXJTQDVO72UJANCNFSM5UZCYWKA ). You are receiving this because you commented. Message ID: <neonbjb/tortoise-tts/issues/17/1154663832 @ github. com>
Yes, I am working with a company, but I cannot announce anything right now. I'll drop a post in here when I can.
It'll likely be a commercial product built in cooperation with an existing TTS company.
😢