TTS icon indicating copy to clipboard operation
TTS copied to clipboard

🐸 TTS roadmap

Open erogol opened this issue 3 years ago β€’ 49 comments

These are the main dev plans for :frog: TTS.

If you want to contribute to :frog: TTS and don't know where to start you can pick one here and start with our Contribution Guideline. We're also always here to help.

Feel free to pick one or suggest a new one.

Contributions are always welcome :muscle: .

v0.1.0 Milestones

  • [x] Better model config handling #21
  • [x] TTS recipes for public datasets.
  • [x] TTS trainer API to unify all the model training scripts.
  • [x] TTS, Vocoder and SpeakerEncoder model abstractions and APIs.
  • [x] Documentation for
    • [x] Implementing a new model using :frog: TTS.
    • [x] Training a model on a new dataset from gecko.
    • [x] Using Synthesizer interface on CLI or Server.
    • [x] Extracting Spectrograms for Vocoder training.
    • [x] Contributing a new pre-trained :frog: TTS model.
    • [x] Explanation for Model config parameters/

v0.2.0 Milestones

  • [x] Grapheme 2 Phoneme in-house conversion. (Thx to gruut πŸ‘ )
  • [x] Implement VITS model.

v0.3.0 Milestones

  • [x] Implement generic ForwardTTS API.
  • [x] Implement Fast Speech model.
  • [x] Implement Fast Pitch model.

v0.4.0 Milestones

  • [x] Trainer API v2 - join the discussion
  • [x] Multi-speaker VCTK recipes for all the TTS.tts models.

v0.5.0 Milestones

  • [x] Support for multi-lingual models
  • [x] YourTTS release πŸš€

v0.6.0 Milestones

  • [x] Add ESpeak support
  • [x] New Tokenizer and Phonemizer APIs #937
  • [x] New Model API #1078
  • [x] Splitting the trainer as a separate repo πŸ‘ŸTrainer
  • [x] Update VITS model API
  • [x] Gradient accumulation. #560 (in πŸ‘Ÿ)

v0.7.0 Milestones

  • [x] Implement Capacitron πŸ‘‘ @a-froghyar πŸ‘‘ @WeberJulian
  • [x] Release pretrained Capacitron

v0.8.0 Milestones

  • [x] Separate numpy transforms
  • [x] Better data sampling for VITS
  • [x] New Thorsten DE models πŸ‘‘ @thorstenMueller

πŸƒβ€β™€οΈ Milestones along the way

  • [ ] Implement End-to-end training API for ForwardTTS models a vocoder. #1510
  • [ ] Implement a Python voice synthesis API.
  • [ ] Inject phonemes to the input text at inference. #1452
  • [ ] AdaSpeech1/2 https://arxiv.org/pdf/2104.09715 and https://arxiv.org/abs/2103.00993
  • [ ] Let the user pass a custom text cleaner function.
  • [ ] Refactor the text cleaners for a more flexible and transparent API.
  • [ ] Implement HifiGAN2 (not the vocoder)
  • [ ] Implement emotion and style adaptation.
  • [ ] Implement FastSpeech2 (https://arxiv.org/abs/2006.04558).
  • [ ] AutoTTS πŸ€– (πŸ‘‘ @loganhart420)
  • [ ] Watermarking TTS outputs to sign against DeepFakes.
  • [ ] Implement SSML v0.0.1
  • [ ] ONNX and TorchScript model exports.
  • [ ] TensorFlow run-time for training models.

πŸ€– New TTS models

  • [x] AlignTTS (@erogol)
  • [x] HiFiGAN (#16 :crown: @rishikksh20 and @erogol)
  • [x] UnivNet Vocoder ( :crown: @rishikksh20)
  • [x] VITS paper
  • [x] FastPitch source
  • [x] Alignment Network paper
  • [x] End2End TTS combining aligner + tts + vocoder.
  • [x] Multi-Lingual TTS (#11 :crown: @WeberJulian )
  • [ ] ParallelTacotron paper (open for contribution)
  • [ ] Efficient TTS paper (open for contribution)
  • [ ] Gaussian length regulator from https://arxiv.org/pdf/2010.04301.pdf (open for contribution)
  • [ ] LightSpeech from https://arxiv.org/pdf/2102.04040.pdf (open for contribution)
  • [ ] AdaSpeech1/2 https://arxiv.org/pdf/2104.09715 and https://arxiv.org/abs/2103.00993

erogol avatar Mar 13 '21 14:03 erogol

great project! Excited to see this growing!

lucascassiano avatar Mar 22 '21 22:03 lucascassiano

I'm learning the code/API and performing experiments. I hope to contribute soon.

I'm also wondering if I can donate (money) to Coqui?

AndrewBarfield avatar Apr 17 '21 21:04 AndrewBarfield

I'm learning the code/API and performing experiments. I hope to contribute soon.

I'm also wondering if I can donate (money) to Coqui?

Wow! Thanks! Humbling.

We were setting up GitHub sponsors, but the tax implications were onerous.

We're currently exploring Patreon. So stay tuned!

kdavis-coqui avatar Apr 18 '21 08:04 kdavis-coqui

@erogol Thanks for sharing the plans!

Do you have any thoughts (or need help to) simplifying the dependencies a bit? I'm thinking that if TTS is used as a lib installed over pip it might be nice to remove visualisation dependencies only used in notebooks, removing test/dev dependencies and moving e.g. tensorflow into extras to reduce the footprint. Personally would love to use this as a dependency rather than maintaining my own fork.

agrinh avatar Apr 26 '21 11:04 agrinh

@agrinh Why do you need to keep your own fork exactly? It'd be better to expand the conversation on gitter if you like.

erogol avatar Apr 26 '21 11:04 erogol

@agrinh Why do you need to keep your own fork exactly? It'd be better to expand the conversation on gitter if you like.

Wow, thanks for the super fast reply. Sure, we can move the discussion to gitter.

agrinh avatar Apr 26 '21 11:04 agrinh

Please add DC-TTS to the the list of models.

DC-TTS implementation available with MIT Licence code available here EFFICIENTLY TRAINABLE TEXT-TO-SPEECH SYSTEM BASED ON DEEP CONVOLUTIONAL NETWORKS WITH GUIDED ATTENTION paper @erogol

Sadam1195 avatar May 06 '21 00:05 Sadam1195

What were you thinking about the "TensorFlow run-time for training models"? Like giving the user the option of using TensorFlow or PyTorch? I wouldn't mind taking a stab at the TensorFlow part.

will-rice avatar Aug 20 '21 23:08 will-rice

@will-rice the plan is to mirror what we have in torch to TF as much as possible. It'd be great if you initiate the work

erogol avatar Aug 23 '21 11:08 erogol

Are you guys planning to develop some expressive TTS architectures? I'm currently studying this topic and planning to implement some of them based on Coqui, part of them just controlling latent space using GST Kwon et al 2020 or RE Sorin et al 2020, and others that actually changes the architecture by adding VAE, normalizing flows and gradient reversal

lucashueda avatar Aug 30 '21 12:08 lucashueda

@lucashueda Capacitron VAE: https://github.com/coqui-ai/TTS/pull/510

a-froghyar avatar Aug 30 '21 12:08 a-froghyar

@lucashueda Capacitron VAE: #510

Oh nice, hope to see Capacitron integrated soon. So maybe, in the future I'll be able to contribute with some others expressive architectures

lucashueda avatar Aug 30 '21 13:08 lucashueda

@erogol Look forward to new End-to-End models being implemented, specfically Efficient-TTS! if the paper is accurate, it should blow most 2 stage configurations out of the water, considering it seems to have higher MOS than tacotron2+hifigan, while also seeming to have significantly faster speed than glowtts+fastest vocoder! I have not seen a single repo replicating the EFTS-Wav architecture described in the paper released 10 months ago, it would be amazing to see it in Coqui first!

BillyBobQuebec avatar Sep 18 '21 21:09 BillyBobQuebec

@BillyBobQuebec I don't think I will implement these models anytime soon. But as they stand, contributions are welcome

erogol avatar Sep 18 '21 23:09 erogol

@BillyBobQuebec but you can try VITS which is close to what you're describing :)

WeberJulian avatar Sep 18 '21 23:09 WeberJulian

@BillyBobQuebec but you can try VITS which is close to what you're describing :)

Agreed, I am currently trying VITS actually, I have some issues training with the coqui implementation unfortunately, I've posted the issue about the bug today and hope I can get it resolved.

BillyBobQuebec avatar Sep 18 '21 23:09 BillyBobQuebec

Hi there! Thanks for your great work! I'm looking forward to training YourTTS on other languages. Will training and fine-tuning code of YourTTS be published soon? I would be very grateful if you could tell me an approximate time~ Have a nice day :-D

hemath1001 avatar Feb 02 '22 06:02 hemath1001

Hello, thanks for great works! I'm a fan of Coqui TTS.

I'm porting some of the stuffs in the project to the Rust for the following reasons.

  • Predictable Performance
  • Static-typed Metadata & Model Management
  • Multithreaded Server Implementation
  • Just I love Rust

The VC in the YourTTS has been successfully implemented. And for this purpose, an example of saving/loading a pretrained Vits model has been added in the repo. I write it on Milestones PR because I think my work can be helpful to others :)

  • Repository (RusTTS): https://github.com/kerryeon/rustts

HoKim98 avatar Feb 23 '22 03:02 HoKim98

@kerryeon great work!! Thanks for sharing!

erogol avatar Feb 23 '22 10:02 erogol

Any plan to a port of coqui-ai engine for android? TTS on android is very robotic (espeak, rhvoice, festival lite).

paolo-caroni avatar Feb 27 '22 10:02 paolo-caroni

No immediate plans on that

erogol avatar Mar 01 '22 10:03 erogol

Thumbs up for planning ONNX support. Hope it gets prioritized more!

Darth-Carrotpie avatar Mar 30 '22 10:03 Darth-Carrotpie

@Darth-Carrotpie what is your use-case of ONNX? (Just want to get some feedback)

erogol avatar Apr 01 '22 08:04 erogol

@Darth-Carrotpie what is your use-case of ONNX? (Just want to get some feedback)

Personally, for me it sounds like a good way to develop Windows nativ TTS applications without needing a Python runtime and/or the big dependencies like pytorch.

I tried exporting the VITS model to onnx before, but didn't succeed. There are also other obstacles beside executing the model, like phonemization. ^^

Currently I am using pythonnet to embed the required python functions directly in my C# code. For Python I use the embedded version to make the App distributable.

lexkoro avatar Apr 05 '22 12:04 lexkoro

@erogol I am trying to run models in Unity. It's environment is in C#, .NET Standard 2.1. Having a universal format model also means in the long run I can not only run models in OS agnostic manner. Of course things like tokenization and phonemization are additional hurdles, but if there are open source examples it's quite doable. For models needing tokenizers I've been using BlingFire succesfully, so I reckon there's similar phonemizer helpers / libraries for other languages beside python, including C#. Edit: things that embed python into C#, like pythonnet are convenient, though quite slow. In my case, where I have multiple models loaded and running at the same time (i.e. ~10) means that needless interpreter overhead can become a critical bottleneck. Plus it might add unforeseen debugging issues.

Darth-Carrotpie avatar Apr 07 '22 07:04 Darth-Carrotpie

@Darth-Carrotpie run in unity means in the code or integrate it to Unity editor?

Also better to move this to a separate post under the Discussions

erogol avatar Apr 07 '22 12:04 erogol

@Darth-Carrotpie run in unity means in the code or integrate it to Unity editor?

Also better to move this to a separate post under the Discussions

Created a topic on ONNX at Discussions: https://github.com/coqui-ai/TTS/discussions/1479

Darth-Carrotpie avatar Apr 08 '22 09:04 Darth-Carrotpie

Is there a flutter package for using this TTS library? Might be an easy way to get this for use in real-world applications.

I am also very new to development but will like to contribute to this project. Can I work under someone?

desh-woes avatar Apr 21 '22 14:04 desh-woes

@desh-woes there is no flutter package, unfortunately.

Can you DM me on Gitter or Element (out chat rooms) if you're willing to work on a particular thing?

erogol avatar Apr 27 '22 08:04 erogol

how train model using word embedding as input

omkarade avatar Jul 03 '22 06:07 omkarade