WhisperSpeech
WhisperSpeech copied to clipboard
multilanguage support
Will support Mandrain?
Hey, great question. Does Whisper work for Mandarin? I found https://github.com/openai/whisper/discussions/25 but it's seem inconclusive to me?
I'll test today how Whisper semantic tokens from an English only model behave when cloning speech in a different language.
whisper stt support mandrain. doesn't know about tts.
I think tts would be a little bit harder.
We plan to train another quantized semantic token model based on the multilingual Whisper medium model soon. Medium seems like a good quality/speed tradeoff that should improve the quality a lot.
If we had a good (speech-only) dataset for other languages we could add them in to get multi-lingual semantic tokens. This would open up a path to train a full multilingual TTS models.
@jpc multilangual tts is an ambitious goal, for Mandrain, TBH, there is no very good open dataset. Biaobei (Baker) could be used as experiment.
The demos in the readme are all trained on around 1000hours so we may can get something usable with this amount of data (and multiple languages may benefit from each other like in Whisper).
To add a language we mainly have to make sure Whisper is working good on it since this is what we use for back translation. I can do this verification for English and Polish. For other languages we need some help.
If you look for some help validating french, I'll be glad to make this very small contribution.
Hey, we now have an English + Polish model so the architecture is validated for other languages. Right now it looks like we need a few hundred hours of speech to fully support a new language although the number will probably drop the more languages we add.
We'll make a plan to help people contribute support for other languages.
https://github.com/microsoft/SynapseML
N part question regarding training a language dialect.
- Is there a way to skip OpenAI Whisper encoder step to generate embeddings?
- I have plenty of transcribed text but for dialects which are obviously not present in the Whisper model.
- Is there a way to still train your model for another language dialect?
- Is 4090 enough to train model?
- Is there a recommended size of dataset to achieve good results with a new language?
I currently have some voice data. How should I start training a new language model? I've read some documents in the readme and nbs, but couldn't find the training steps.
This project looks really promising and I like the quality of the generated speech audios 😄
Looking forward to this one!
I would like to add support for Hebrew too. OpenAI API of whisper already support tts for Hebrew. The only problem is that the speaker accent is american instead of Hebrew accent, but that's still usable!
- Can I use records of different speakers as train data and get another better final voice like you already did?
- How many hours of records should I grab?
- Do you have some information about how should I train it and eventually creating a PR for this repo?