Multispeaker GlowTTS

This code is a replication of official Glow TTS code. If you want to use Glow TTS model, I recommend that you refer to the official code.
The following is the paper I referred:

Requirements

torch >= 1.5.1
tensorboardX >= 2.0
librosa >= 0.7.2
matplotlib >= 3.1.3
Optional for loss flow
- tensorboard >= 2.2.2

Structure

Vanilla mode (Single speaker GlowTTS)

Training

Inference

Speaker embedding mode

Training

Inference

Prosody encoding mode (GST GlowTTS)

Training

Inference

Gradient reversal mode (Voice cloning GlowTTS - Failed)

Training

Inference

Used dataset

Currently uploaded code is compatible with the following datasets.
The O marks to the left of the dataset name are the dataset actually used in the uploaded result.

Single	Multi	Dataset	Dataset address
O	O	LJSpeech	https://keithito.com/LJ-Speech-Dataset/
X	X	BC2013	http://www.cstr.ed.ac.uk/projects/blizzard/
X	O	CMU Arctic	http://www.festvox.org/cmu_arctic/index.html
X	O	VCTK	https://datashare.is.ed.ac.uk/handle/10283/2651
X	X	LibriTTS	https://openslr.org/60/

Hyper parameters

Before proceeding, please set the pattern, inference, and checkpoint paths in 'Hyper_Parameters.yaml' according to your environment.

Sound
- Setting basic sound parameters.
- Some paramters like pitch are not used in current code. These are for future works.
Use_Cython_Alignment
- Setting which implementation of Monotonic alignment search to use
- If true, the cython implementation of official code will be used.
- If false, the python implementation will be used.
- I recommend to use cython implementation because of speed.
  - But, to use cython implementation, you must complie this before running.
  - Please refer following: https://github.com/jaywalnut310/glow-tts#2-pre-requisites
Encoder
- Setting the encoder parameters
Decoder
- Setting the glow decoder parameters.
WaveNet
- Setting the parameters of Vocoder.
- This implementation uses a pre-trained Parallel WaveGAN model.
  - https://github.com/CODEJIN/PWGAN_Torch
- If checkpoint path is null, model does not exports wav files.
- If checkpoint path is not null, all parameters must be matched to pre-trained Parallel WaveGAN model.
Speaker_Embedding
- Setting the speaker embedding generating method
- In Type, you can select null, 'LUT', 'GE2E'
  - null: No speaker embedding. Single speaker version
  - LUT: Model will generate a lookup table about the speakers.
  - GE2E: Model will use d-vectors which is generated by a pretrained GE2E model.
    - Pretrained GE2E model is from Speaker_Embedding_Torch
Token path
- Setting the token-to-index dict.
- Pattern generator makes this file.
Train
- Setting the parameters of training.
Inference_Batch_Size
- Setting the batch size when inference.
- If null, it will be same to Train/Batch_Size
Inference_Path
- Setting the inference path
Checkpoint_Path
- Setting the checkpoint path
Log_Path
- Setting the tensorboard log path
Use_Mixed_Precision
- Setting mixed precision.
- To use, Nvidia apex must be installed in the environment.
- In several preprocessing hyper parameters, loss overflow problem occurs.
Device
- Setting which GPU device is used in multi-GPU enviornment.
- Or, if using only CPU, please set '-1'.

Generate pattern

Command

python Pattern_Generate.py [parameters]

Parameters

At least, one or more of datasets must be used.

-lj
- Set the path of LJSpeech. LJSpeech's patterns are generated.
-bc2013
- Set the path of Blizzard Challenge 2013. Blizzard Challenge 2013's patterns are generated.
-cmua
- Set the path of CMU arctic. CMU arctic's patterns are generated.
-vctk
- Set the path of VCTK. VCTK's patterns are generated.
-libri
- Set the path of LibriTTS. LibriTTS's patterns are generated.
-vc1
- Set the path of VoxCeleb1. Glow-TTS does not supports this because VoxCeleb datasets do not have text data.
-vc2
- Set the path of VoxCeleb2. Glow-TTS does not supports this because VoxCeleb datasets do not have text data.
-vc1t
- Set the path of VoxCeleb1 testset. Glow-TTS does not supports this because VoxCeleb datasets do not have text data.
-text
- Set whether the text information save or not.
- This is for other model. To use in Glow TTS, this option must be set.
-evalr
- Set the evaluation pattern ratio.
- Default is 0.001.
-evalm
- Set the evaluation pattern minimum of each speaker.
- Default is 1.
-mw
- The number of threads used to create the pattern
- Default is 10.

Run

Command

python Train.py -s <int>

-s <int>
- The resume step parameter.
- Default is 0.
- When this parameter is 0, model try to find the latest checkpoint in checkpoint path.

Inference

Please check example files for the inference
- Inference_Example.ipynb
- Inference.py

Result

Please see at the demo site

Trained checkpoint

Mode	Dataset	Trained steps	Link
Vanilla	LJ	100000	Link(Broken)
SE & LUT	LJ + CUMA	100000	Link
SE & LUT	LJ + VCTK	100000	Link
PE	LJ + CUMA	100000	Link
PE	LJ + VCTK	400000	Link
GR & LUT	LJ + VCTK	400000	Link(Failed)

Future works

Training with GE2E speaker embedding
Gradient reversal model structure improvement
Training additional steps

Glow_TTS
Glow_TTS copied to clipboard

Metadata

Multispeaker GlowTTS

Requirements

Structure

Vanilla mode (Single speaker GlowTTS)

Training

Inference

Speaker embedding mode

Training

Inference

Prosody encoding mode (GST GlowTTS)

Training

Inference

Gradient reversal mode (Voice cloning GlowTTS - Failed)

Training

Inference

Used dataset

Hyper parameters

Generate pattern

Command

Parameters

Run

Command

Inference

Result

Trained checkpoint

Future works

← Metadata

Owner

Metadata

Glow_TTS Glow_TTS copied to clipboard

Metadata

Multispeaker GlowTTS

Requirements

Structure

Vanilla mode (Single speaker GlowTTS)

Training

Inference

Speaker embedding mode

Training

Inference

Prosody encoding mode (GST GlowTTS)

Training

Inference

Gradient reversal mode (Voice cloning GlowTTS - Failed)

Training

Inference

Used dataset

Hyper parameters

Generate pattern

Command

Parameters

Run

Command

Inference

Result

Trained checkpoint

Future works

← Metadata

Owner

Metadata

Glow_TTS
Glow_TTS copied to clipboard