WavThruVec_pytorch
WavThruVec_pytorch copied to clipboard
An implementation of Charactr, Inc's "WavThruVec: Latent speech representation as intermediate features for neural speech synthesis"
WavThruVec Pytorch
An Unofficial Implementation of WavThruVec Based on Pytorch.
The original paper is WavThruVec: Latent speech representation as intermediate features for neural speech synthesis
architecture
The Text2Vec model mostly follows the fastspeech (xcmyz's) architecture. I modified the model, mainly based on rad-tts (nvidia's). And I add an ECAPA_TDNN as speaker encoder, for multi-speaker condition.
For other details not mentioned in the paper, I also follow the rad-tts.
The Vec2Wav is mostly based on the hifi-gan, and introduce Conditional Batch Normalization to condition the network on the speaker embedding. The upsample rates sequence is (5,4,4,2,2) so the upsampling factor is $\times 320$ (original paper is $\times 640$), in other words, the generated WAVs have a sample rate of 16khz (32khz in original paper),.
text2vec training
text2vec inference
vec2wav
Input
- for text:
Do not use any rule-based text normalization or phonemization methods, but feed raw character and transform to text-embedding as inputs.
- for audio:
Use wav2vec 2.0's output as the wav's feature(instead of mel spectrogram), with a dtype of 'float32'
and a shape of (batch_size, n_frame, n_channel)
.
note: n_channel=768 or 1024, it depends on which version of the wav2vec 2.0 pretrained model you are using, because TencentGameMate provide fairseq-version(768) and huggingface-version(1024). These two version has different output shape.
wav2vec 2.0 pretrained
From this repository wav2vec2.0 (chinese speech pretrain), and it can also be found at huggingface
attn_prior
One of the biggest difference between WavThruVec and FastSpeech is the monotonic alignment search(MAS) module (refer to the alignment.py
).
In FastSpeech, the training inputs include Teacher-Forcing Alignment for mel frames and text tokens. Specifically, it involves using MFA to generate the duration
of mel frames for each text token before training.
While in WavThruVec, the duration
is generated using the MAS from the rad-tts, and is fed into the LengthRegulator(DurationPredictor).
According to monotonic alignment search and rad-tts implementation, when you training the model, align-prior files would be generated under './data/align_prior'
directory, with the file name format of {n_token}_{n_feat}_prior.pth
.
environment
- CUDA 10.1
- python 3.9.7
- torch 1.8.1+cu101
- torch-optimizer 0.3.0
- torchaudio 0.8.1
- tensorboard 2.12.0
- librosa 0.8.0
- numba 0.56.4
- numpy 1.22.4
- llvmlite 0.39.1
dataset and prepare
The prepare_data.py:
- 1.read the wav files and wav2vec2 pretrained model, resample the wavs to 16khz, and convert to .npy files, which contrain the corresponding wav2vec 2.0 feature.
- 2.read the aishell3 transcription(content.txt), and filter the Chinese phoneme and blank. Take the transcription and file path to build the train list(./data/enc_train.txt).
- 3.build the vocab, which will be used to convert the characters to torch Variable.
As an example, prepare_data.py only take a few speakers and a few wav files.
training
WavThruVec contrains 2 components: Text2Vec(encoder) and Vec2Wav(decoder), and they train independently
Thus, I placed them in two separate dirs and used different training configurations for each.
TensorBoard
The TensorBoard loggers are stored in the run/{log_seed}/tb_logs
directory.
Suppose log_seed=1
, you can use this command to serve the TensorBoard on your localhost.
tensorboard --logdir run/1/tb_logs
save checkpoint and restore
The model checkpoints are saved in the run/{log_seed}/model_new
directory.
Suppose you save checkpoints every 10000 iterations, and now you have a checkpoint checkpoint_10000.pth.tar
.
If you need to restart training at step 10000
, then use this command.
python ./text2vec/train.py --restore_step 10000
Todo
- experiment & Performace
- More details for implementation
Reference
Repository
- fastspeech (xcmyz's)
- wav2vec2.0 (chinese speech pretrain)
- rad-tts (nvidia's)
- gan-tts (yanggeng1995's)
- hifi-gan
- Fastpitch (dan-wells')
- ecapa_tdnn (Tao Ruijie's)
- ecapa_tdnn (lawlict's)
- glow-tts (jaywalnut310's)