dc_tts_GUI icon indicating copy to clipboard operation
dc_tts_GUI copied to clipboard

GUI Wrapper for 'A TensorFlow Implementation of DC-TTS: yet another text-to-speech model'

Overview

A machine learning based Text to Speech program with a user friendly GUI. Target audience include Twitch streamers or content creators looking for an open source TTS program. The aim of this software is to make tts synthesis accessible offline (No coding experience, gpu/colab) in a portable exe.

Features

  • Reads donations from Stream Elements automatically
  • PyQt5 wrapper for dc_tts

Download link

A portable executable can be found at the Releases page, or directly here. Download a pretrained model separately (from below) to start playing with text to speech.

Warning: the portable executable runs on CPU which leads to a >10x speed slowdown compared to running it on GPU. I might consider other faster models in the future for CPU inference.

Pretrained Model

A pretrained model for dc_tts is available from Kyubyong's repo or directly here. Kyubyong also provides pretrained models for 10 different languages from the CSS10 dataset. Of course, you are encouraged to try building your own custom voices to use with this GUI.

Todo

  • [x] Pygame mixer instead of sounddevice
  • [x] PyQt threading
  • [x] Package into portable executable (cx_freeze/pyinstaller)
  • [ ] pyqt instead of pygame volume control
  • [ ] Websockets
  • [ ] Add neural vocoder (Waveglow?) instead of griffin-lim
  • [ ] Phoneme support with seq2seq model or espeak
  • [ ] Make a tutorial page
  • [ ] Add streamlabs support

Building from source

Requirements

  • Python >=3.7
  • librosa
  • numpy
  • PyQt5==5.15.0
  • requests
  • tensorflow>=1.13.0,<2.0.0
  • tqdm
  • matplotlib
  • scipy
  • num2words
  • pygame

To Run

python gui.py

To train custom voices (transfer learning)

The training steps are slightly modified from kyubyong to fix #11. The training data is in the format of LJ Speech dataset and the expected folder structure is

.
└── data
    ├── wavs
    │   ├── data1.wav
    │   └── data2.wav
    └── transcript.csv

Steps

  1. Use 22050Hz, 16 bit signed PCM wav files. Other formats are untested.
  2. Create a csv transcript in the metadata convention of LJ Speech dataset and save it in the folder structure shown above.
  3. Extract the two folders in the pretrained model. Edit hyperparams.py to point to the location of the folders.
  4. Run python prepro.py
  5. Run python train.py 1 to train Text2Mel
  6. Run python train.py 2 to train SSRN And you're done! You can load the model using the GUI to perform synthesis.

License

  • dc_tts: Apache License v2.0

Notes