vall-e
vall-e copied to clipboard
Training result
I'd like to inquire about the training results. I have combined datasets AISHELL3, aidata, and a Chinese dataset, totaling 600 hours of training. Although the three audio files are not 24000Hz, I have set cut_set = cut_set.resample(24000) in the line 184 in bin/tokenizer.py, so they should have been converted to 24000Hz. I have followed the document's instructions, using the prefix-1 training method.
python3 bin/trainer.py --world-size 2 --max-duration 80 --filter-min-duration 0.5 --filter-max-duration 14 --train-stage 1
--num-buckets 6 --dtype "bfloat16" --save-every-n 10000 --valid-interval 20000
--model-name valle --share-embedding true --norm-first true --add-prenet false
--decoder-dim 1024 --nhead 16 --num-decoder-layers 12 --prefix-mode 1
--base-lr 0.05 --warmup-steps 200 --average-period 0
--num-epochs 20 --start-epoch 1 --start-batch 0 --accumulate-grad-steps 4
--exp-dir ${exp_dir}
Train NAR model cp ${exp_dir}/best-valid-loss.pt ${exp_dir}/epoch-2.pt # --start-epoch 3=2+1
python3 bin/trainer.py --world-size 2 --max-duration 40 --filter-min-duration 0.5 --filter-max-duration 14 --train-stage 2
--num-buckets 6 --dtype "float32" --save-every-n 10000 --valid-interval 20000
--model-name valle --share-embedding true --norm-first true --add-prenet false
--decoder-dim 1024 --nhead 16 --num-decoder-layers 12 --prefix-mode 1
--base-lr 0.05 --warmup-steps 200 --average-period 0
--num-epochs 40 --start-epoch 3 --start-batch 0 --accumulate-grad-steps 4
--exp-dir ${exp_dir}
But when using the synthesized audio files and synthesizing with unseen data, the following situations occur:
- Often the latter part of the prompt appears at the beginning of the synthesized speech.
- Synthesizing long sentences leads to repeated or skipped segments in the latter part of the synthesis. Is there any way to improve these situations?"