vall-e
vall-e copied to clipboard
PyTorch implementation of VALL-E(Zero-Shot Text-To-Speech), Reproduced Demo https://lifeiteng.github.io/valle/index.html
Hello, I am working with an Ubuntu 22, a NVIDIA RTX 3080, 64GB RAM I followed the steps of the DEMO in the README.md to train a model of LibriTTS....
### Add bilingual Mandarin and English support using BigCiDian 1. Add a phonemize backend *G2PBackend* 2. Import compiled [BigCiDian](https://github.com/speechio/BigCiDian) to g2p to support bilingual Mandarin and English 3. Add *userdict.txt*...
### 4.14 - https://github.com/lifeiteng/vall-e/pull/85 Refactored TextTokenizer - [code change](https://github.com/lifeiteng/vall-e/pull/85/files#diff-db0bfc2a9604102b98361aae3174bd5d2e7027e44bebf3d592e16a6f4d543581R152) and [test](https://github.com/lifeiteng/vall-e/pull/85/files#diff-91b6947dde6b1a2132060367c398eab274c2c45382591f46f5088eebe8fe733eR28) - before `two -> t u ː` after `two -> t uː` ### 4.xx
There is a 'train-stage' option in trainer.py In egs/libritts, there is two training precedures with different 'train-stage' options. Which is better in terms of synthesis results?
The results of inference are not the same with the same config!
I used cut_set.normalize_loudness because the loudness of aishell audio files is small, https://github.com/lifeiteng/vall-e/blob/main/valle/bin/tokenizer.py#L173 ``` if args.prefix == "aishell": # NOTE: the loudness of aishell audio files is around -33 #...
Regarding the loss calculation part of the AR model, why isn't the mask being handled? ``` total_loss = F.cross_entropy(logits, targets, reduction=reduction) ``` Normally, shouldn't it be: ``` total_loss = F.cross_entropy(logits.mask_selected(y_mask),...
When I'm preparing datasets of libritts, I run into this issue: `Scanning audio files (*.wav): 0it [00:00, ?it/s] Preparing LibriTTS parts: 71%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊ | 5/7 [00:00
Why should add_prenet be set to false? If it is not set to True, false is indeed better after the experiment, but I do not understand why, can you help...
选择自己的音频作为prompt进行推理时, 会出现 raise SyntaxError( SyntaxError: well trained model shouldn't reach here.的错误