vall-e icon indicating copy to clipboard operation
vall-e copied to clipboard

Training instructions from README.md not working for me

Open davidmartinrius opened this issue 1 year ago • 4 comments

Hello,

I am working with an Ubuntu 22, a NVIDIA RTX 3080, 64GB RAM

I followed the steps of the DEMO in the README.md to train a model of LibriTTS.

image

The result after the inference is wrong. It sounds like a weird noise. I attached the wav inside a zip because github does not allow to upload a wav.

0.zip

I ran the inference like in the instructions:

python3 bin/infer.py --output-dir infer/demos \
    --model-name valle --norm-first true --add-prenet false \
    --share-embedding true --norm-first true --add-prenet false \
    --text-prompts "KNOT one point one five miles per hour." \
    --audio-prompts ./prompts/8463_294825_000043_000000.wav \
    --text "To get up and running quickly just follow the steps below." \
    --checkpoint=${exp_dir}/best-valid-loss.pt

Please, can you help me to understand what am I doing wrong?

Ask me for any information you need to analyze, I will provide it.

When training I had to change the parameter --max-duration to prevent out of memory error.

In AR model I changed --max-duration to 20 In NAR model I changed --max-duration to 15

In both cases I had to remove "--valid-interval 20000" because this parameter is not recognized by bin/trainer.py

Thank you,

David Martin Rius

davidmartinrius avatar Apr 21 '23 13:04 davidmartinrius

@davidmartinrius The problem is --max-duration to 20 which means the batch_size is in [1, 6].

try to train the model on 3090/4090 or A100.

lifeiteng avatar Apr 22 '23 12:04 lifeiteng

@lifeiteng thanks for your response. Your response is not clear for me. Maybe the batch size is lower because of the vram, and it means that needs more iterations, but it should not affect the performance. I'm sorry but your response is not useful for me. It should be able to train it in almost any Nvidia RTX > 3000 series GPU...

Please, if you really think that the max duration is the problem, can you explain how to adapt it to a 10GB GPU?...

davidmartinrius avatar Apr 22 '23 14:04 davidmartinrius

@davidmartinrius Small batch_size will not converge to a good local optimal point. It's common sense in DeepLearning.

lifeiteng avatar Apr 23 '23 02:04 lifeiteng

I agree with you in this point. I understand that when the batch size is too small, the gradients computed from the batch may not be representative of the overall structure of the dataset, leading to unstable and slow convergence during training.

Said that, do you think is it possible to make it work adjusting the gradient accumulation, the learning rate, batch normalization or even adding more layers? Actually I don't know the whole project, maybe you could valuate it.

If there is a way to optimize it I would like to try it. I know it means more training hours and more development.

Thank you!

David Martin Rius

davidmartinrius avatar Apr 23 '23 10:04 davidmartinrius