Glow_TTS icon indicating copy to clipboard operation
Glow_TTS copied to clipboard

An implement of GlowTTS model. Several modes are added: speaker embedding, prosody encoder(GST), and gradient reversal.

Multispeaker GlowTTS

Requirements

  • torch >= 1.5.1

  • tensorboardX >= 2.0

  • librosa >= 0.7.2

  • matplotlib >= 3.1.3

  • Optional for loss flow

    • tensorboard >= 2.2.2

Structure

Vanilla mode (Single speaker GlowTTS)

Training

Inference

Speaker embedding mode

Training

Inference

Prosody encoding mode (GST GlowTTS)

Training

Inference

Gradient reversal mode (Voice cloning GlowTTS - Failed)

Training

Inference

Used dataset

  • Currently uploaded code is compatible with the following datasets.
  • The O marks to the left of the dataset name are the dataset actually used in the uploaded result.
Single Multi Dataset Dataset address
O O LJSpeech https://keithito.com/LJ-Speech-Dataset/
X X BC2013 http://www.cstr.ed.ac.uk/projects/blizzard/
X O CMU Arctic http://www.festvox.org/cmu_arctic/index.html
X O VCTK https://datashare.is.ed.ac.uk/handle/10283/2651
X X LibriTTS https://openslr.org/60/

Hyper parameters

Before proceeding, please set the pattern, inference, and checkpoint paths in 'Hyper_Parameters.yaml' according to your environment.

  • Sound

    • Setting basic sound parameters.
    • Some paramters like pitch are not used in current code. These are for future works.
  • Use_Cython_Alignment

    • Setting which implementation of Monotonic alignment search to use
    • If true, the cython implementation of official code will be used.
    • If false, the python implementation will be used.
    • I recommend to use cython implementation because of speed.
      • But, to use cython implementation, you must complie this before running.
      • Please refer following: https://github.com/jaywalnut310/glow-tts#2-pre-requisites
  • Encoder

    • Setting the encoder parameters
  • Decoder

    • Setting the glow decoder parameters.
  • WaveNet

    • Setting the parameters of Vocoder.
    • This implementation uses a pre-trained Parallel WaveGAN model.
      • https://github.com/CODEJIN/PWGAN_Torch
    • If checkpoint path is null, model does not exports wav files.
    • If checkpoint path is not null, all parameters must be matched to pre-trained Parallel WaveGAN model.
  • Speaker_Embedding

    • Setting the speaker embedding generating method
    • In Type, you can select null, 'LUT', 'GE2E'
      • null: No speaker embedding. Single speaker version
      • LUT: Model will generate a lookup table about the speakers.
      • GE2E: Model will use d-vectors which is generated by a pretrained GE2E model.
  • Token path

    • Setting the token-to-index dict.
    • Pattern generator makes this file.
  • Train

    • Setting the parameters of training.
  • Inference_Batch_Size

    • Setting the batch size when inference.
    • If null, it will be same to Train/Batch_Size
  • Inference_Path

    • Setting the inference path
  • Checkpoint_Path

    • Setting the checkpoint path
  • Log_Path

    • Setting the tensorboard log path
  • Use_Mixed_Precision

    • Setting mixed precision.
    • To use, Nvidia apex must be installed in the environment.
    • In several preprocessing hyper parameters, loss overflow problem occurs.
  • Device

    • Setting which GPU device is used in multi-GPU enviornment.
    • Or, if using only CPU, please set '-1'.

Generate pattern

Command

python Pattern_Generate.py [parameters]

Parameters

At least, one or more of datasets must be used.

  • -lj
    • Set the path of LJSpeech. LJSpeech's patterns are generated.
  • -bc2013
    • Set the path of Blizzard Challenge 2013. Blizzard Challenge 2013's patterns are generated.
  • -cmua
    • Set the path of CMU arctic. CMU arctic's patterns are generated.
  • -vctk
    • Set the path of VCTK. VCTK's patterns are generated.
  • -libri
    • Set the path of LibriTTS. LibriTTS's patterns are generated.
  • -vc1
    • Set the path of VoxCeleb1. Glow-TTS does not supports this because VoxCeleb datasets do not have text data.
  • -vc2
    • Set the path of VoxCeleb2. Glow-TTS does not supports this because VoxCeleb datasets do not have text data.
  • -vc1t
    • Set the path of VoxCeleb1 testset. Glow-TTS does not supports this because VoxCeleb datasets do not have text data.
  • -text
    • Set whether the text information save or not.
    • This is for other model. To use in Glow TTS, this option must be set.
  • -evalr
    • Set the evaluation pattern ratio.
    • Default is 0.001.
  • -evalm
    • Set the evaluation pattern minimum of each speaker.
    • Default is 1.
  • -mw
    • The number of threads used to create the pattern
    • Default is 10.

Run

Command

python Train.py -s <int>
  • -s <int>
    • The resume step parameter.
    • Default is 0.
    • When this parameter is 0, model try to find the latest checkpoint in checkpoint path.

Inference

  • Please check example files for the inference
    • Inference_Example.ipynb
    • Inference.py

Result

Please see at the demo site

Trained checkpoint

Mode Dataset Trained steps Link
Vanilla LJ 100000 Link(Broken)
SE & LUT LJ + CUMA 100000 Link
SE & LUT LJ + VCTK 100000 Link
PE LJ + CUMA 100000 Link
PE LJ + VCTK 400000 Link
GR & LUT LJ + VCTK 400000 Link(Failed)

Future works

  • Training with GE2E speaker embedding
  • Gradient reversal model structure improvement
  • Training additional steps