StyleTTS2
StyleTTS2 copied to clipboard
Added importable module
Fixes #51
You need to provide your own phonemizer (because of this), and can use it like so:
from styletts2 import TTS
import sounddevice as sd
import phonemizer
tts = TTS.load_model(
config_path="hf://yl4579/StyleTTS2-LibriTTS/Models/LibriTTS/config.yml",
checkpoint_path="hf://yl4579/StyleTTS2-LibriTTS/Models/LibriTTS/epochs_2nd_00020.pth"
)
es_phonemizer = phonemizer.backend.EspeakBackend(
language='en-us',
preserve_punctuation=True,
with_stress=True
)
style = tts.compute_style('../tts-server/tts_server/voices/en-f-1.wav')
wav, _ = tts.inference(
"This is a text! Hello world! How are you? What's your name?",
style,
phonemizer=es_phonemizer,
alpha=0.3,
beta=0.7,
diffusion_steps=10,
embedding_scale=2)
sd.play(wav, 24000)
sd.wait()
See https://github.com/yl4579/StyleTTS2/pull/78#issuecomment-1826117745, the same problem of GPL license.
Phonemizer was already included in the project. I can remove phonemizer dependency and just allow people to pass their own phonemizers.
Ah your usage of phonemizer is "only to run the demo":
https://github.com/yl4579/StyleTTS2/blob/17c6b6120ca99b193ed500fa8c6dc1820edccff8/README.md?plain=1#L39
Which I guess makes sense in this case :)
Also I tried using @fakerybakery 's idea of using DeepPhonemizer, but it's not nearly as good as espeak
I changed it so a phonemizer needs to be explicitly loaded
wav, _ = tts.inference(
"This is a text! Hello world! How are you? What's your name?",
style,
phonemizer=es_phonemizer,
alpha=0.3,
beta=0.7,
diffusion_steps=10,
embedding_scale=2)
Hi @lxe, my fork supports importing. I think the author @yl4579 mentioned it would be better to keep a separate GPL'd fork.
https://github.com/NeuralVox/StyleTTS2
I will try to keep it updated with the main repo
@fakerybakery @lxe Have you checked https://github.com/lingjzhu/CharsiuG2P?
Hmm! Looks interesting. Basically a T5 model trained on phonemes. I'll try it out in the upcoming days
Seems like on the tiny model there are some issues, I'll try out the larger models later. Input Text: Hello world! CharsiuG2P: hɛlowoɐ̯ldˈeslo Phonemizer: həloʊ wɜːld
Yup I've been checking Charsui and Text2PhonemeSequence
They don't do well with stress and have other artifacts
Opportunity for a new open source project: phonemizer alternative that supports many languages and is compatible with espeak!
Coqui ships MPL2.0 / commercial product, but using espeak-ng like this ?
Yeah, they're probably violating the license (IANAL). Does anyone know C well to reverse engineer espeak?
Gruut is a bust too. It over-stresses things and isn't nearly as accurate as espeak
Sort of funny. MPL is compatible with GPL but not the other way around.
Yeah. But training a T5 model on phonemizer doesn't seem too hard though. You just get a text dataset in that language, phonemize it using phonemizer, and train the model. The main thing is that it's expensive. @yl4579 if a multilingual phonemizer dataset were available would the compute you have access to be enough to train a phonemizer T5 model?
The way Coqui TTS does it is by expecting an espeak-ng binary to be available. This actually doesn't seem to violate GPL.
Hmm, does phonemizer do the same thing? Also, we could always write a script to start a phonemizer server on localhost and have it call the API
Relevant discussions:
https://github.com/rhasspy/piper/issues/93 https://github.com/espeak-ng/espeak-ng/issues/908
If there is a decent enough or sometimes usable phonemizer alternative, I can integrate it into my TTS web ui. Since I do full install scripts, the install phonemizer yourself approach is not really viable.
Use gruut- see the styletts2 pip package on PyPI