TTS
TTS copied to clipboard
[Bug] The model does not train with the new hyperparameters given by command line when trying to restart a training with `restore_path` and `continue_path`
Describe the bug
I am trying to continue the training of a multi-speaker VITS model in Catalan with 4 16GB V100 GPUs.
I want to try to modify some different hyperparameters (like the learnig rate) to find the most optimal configuration. When launching the new trainings with the --restore_path
argument and other hyperparameter arguments, a new config is created with the updated hyperparameters. However, in the training, the model does not use these new hyperparameters, but uses the same ones that appeared in the original model config.
In the "to reproduce" section I attach the config of the original training and the config, the logs and the command line used to run the new training.
Regarding the --contine_path
argument, when wanting to continue the training from the same point where the training stopped, the model resets the learning rate to that of the original config.
As in both cases the behavior is the same (using the parameters of the original config ignoring the new ones passed by command line) I thought it appropriate to mention them in the same issue.
To Reproduce
Original config: config.txt
New generated config: config.txt
logs of the new training: trainer_0_log.txt
The above logs show current_lr_0: 0.00050
current_lr_1: 0.00050
:
[1m --> STEP: 24/1620 -- GLOBAL_STEP: 170025[0m
| > loss_disc: 2.35827 (2.46076)
| > loss_disc_real_0: 0.14623 (0.14530)
| > loss_disc_real_1: 0.23082 (0.20939)
| > loss_disc_real_2: 0.22020 (0.21913)
| > loss_disc_real_3: 0.19430 (0.22623)
| > loss_disc_real_4: 0.21045 (0.22390)
| > loss_disc_real_5: 0.20165 (0.23435)
| > loss_0: 2.35827 (2.46076)
| > grad_norm_0: 24.36758 (16.55595)
| > loss_gen: 2.37695 (2.37794)
| > loss_kl: 2.56117 (2.30560)
| > loss_feat: 9.57505 (8.38634)
| > loss_mel: 22.84378 (22.47223)
| > loss_duration: 1.59958 (1.55717)
| > loss_1: 38.95654 (37.09929)
| > grad_norm_1: 192.16046 (145.46979)
| > current_lr_0: 0.00050
| > current_lr_1: 0.00050
| > step_time: 0.96620 (1.22051)
| > loader_time: 0.00510 (0.00600)
Below I attach the command line used to launch the new training:
export RECIPE="${RUN_DIR}/recipes/multispeaker/vits/experiments/train_vits_ca.py"
export RESTORE="${RUN_DIR}/recipes/multispeaker/vits/experiments/checkpoint_vits_170000.pth"
python -m trainer.distribute --script ${RECIPE} -gpus "0,1,2,3" \
--restore_path ${RESTORE} --coqpit.lr_gen 0.0002 --coqpit.lr_disc 0.0002 \
--coqpit.eval_batch_size 8 --coqpit.epochs 4 --coqpit.batch_size 16
Expected behavior
No response
Logs
No response
Environment
{
"CUDA": {
"GPU": [
"Tesla V100-SXM2-16GB",
"Tesla V100-SXM2-16GB",
"Tesla V100-SXM2-16GB",
"Tesla V100-SXM2-16GB"
],
"available": true,
"version": "10.2"
},
"Packages": {
"PyTorch_debug": false,
"PyTorch_version": "1.9.0a0+git3d70ab0",
"TTS": "0.6.2",
"numpy": "1.19.5"
},
"System": {
"OS": "Linux",
"architecture": [
"64bit",
"ELF"
],
"processor": "ppc64le",
"python": "3.7.4",
"version": "#1 SMP Tue Sep 25 12:28:39 EDT 2018"
}
}
Additional context
Trainer was updated to trainer==0.0.13. Please let me know if you need more information and thank you in advance.
Hi @GerrySant did you try running train-tts.py instead of a recipe as suggested by @erogol (see https://github.com/coqui-ai/TTS/discussions/1719#discussioncomment-3187998)? I believe I was facing the same issue and this trick solved it.
Good morning @Ca-ressemble-a-du-fake, I tried to use train_tts.py and now I have another error related to the own data. I've been trying to see where the error comes from to give you an answer as soon as possible but I'm still working on it. I suspect it has to do with not finding the custom formatter for my data. I attach the log of the new error. Thanks for the quick response. multispeaker_runner_gs_0002_8_16_restore_config_6283248.txt
Hi @GerrySant the "name" parameter in the "datasets" section in your config file deals with the formatter name not the voice name. So if you formatted your dataset with an LJSpeech like structure, then you should use "ljspeech" in "datasets/name". If you need a custom formatter then define it in "TTS/TTS/tts/dataset/formatters.py" and then use this name in your config file.
Hi @Ca-ressemble-a-du-fake, Thanks again for your quick response.
I have structured the data so that each speaker has its own metadata and folder. Therefore, I am treating the data for each of my speakers as if they were separate datasets. However, I am going to try the option of adding a custom formatter in formatters.py. As soon as I have new information I will do an update.
Hi @Ca-ressemble-a-du-fake, Thanks again for your quick response.
I have structured the data so that each speaker has its own metadata and folder. Therefore, I am treating the data for each of my speakers as if they were separate datasets. However, I am going to try the option of adding a custom formatter in formatters.py. As soon as I have new information I will do an update.
you can pass a formatter in 'load_tts_samples' function: https://github.com/coqui-ai/TTS/blob/dev/TTS/tts/datasets/init.py#L77 this should work just as well as adding the formatter the formatters.py
Good morning @Ca-ressemble-a-du-fake, I tried to use train_tts.py and now I have another error related to the own data. I've been trying to see where the error comes from to give you an answer as soon as possible but I'm still working on it. I suspect it has to do with not finding the custom formatter for my data. I attach the log of the new error. Thanks for the quick response. multispeaker_runner_gs_0002_8_16_restore_config_6283248.txt
put your formatter anywhere in the 'train_tts.py` function and make sure you pass it here: https://github.com/coqui-ai/TTS/blob/dev/TTS/bin/train_tts.py#L47 let me know if this fixes the problem with hyper params
Hi @loganhart420 I am the colleague of @GerrySant. In the end we restructured our data in vctk_old
format and launched some processes using the train_tts.py
, and we still have the same problem, i.e. the stderr shows that the lr_gen
and lr_disc
used are not consistent with the value coming from coqpit. This time we tried it both for v0.6.2 and v0.8.0.
Although the results are the same for all (the initially shared configs and the new two) I am attaching the input and output configs plus the log for the process launched using TTS v0.8.0.
For the command:
export RUN_DIR=./TTS_v0.8.0
module purge
source $RUN_DIR/use_venv.sh
export RECIPE=${RUN_DIR}/TTS/bin/train_tts.py
export CONFIG=${RUN_DIR}/recipes/multispeaker/config_experiments/config_mixed.json
export RESTORE=${RUN_DIR}/../TTS/recipes/multispeaker/vits/config_experiments/best_model.pth
CUDA_VISIBLE_DEVICES="0" python ${RECIPE} --config_path ${CONFIG} --restore_path ${RESTORE} \
--coqpit.lr_disc 0.0001 --coqpit.lr_gen 0.0001 \
--coqpit.batch_size 32
Hi @loganhart420 I am the colleague of @GerrySant. In the end we restructured our data in
vctk_old
format and launched some processes using thetrain_tts.py
, and we still have the same problem, i.e. the stderr shows that thelr_gen
andlr_disc
used are not consistent with the value coming from coqpit. This time we tried it both for v0.6.2 and v0.8.0.Although the results are the same for all (the initially shared configs and the new two) I am attaching the input and output configs plus the log for the process launched using TTS v0.8.0.
For the command:
export RUN_DIR=./TTS_v0.8.0 module purge source $RUN_DIR/use_venv.sh export RECIPE=${RUN_DIR}/TTS/bin/train_tts.py export CONFIG=${RUN_DIR}/recipes/multispeaker/config_experiments/config_mixed.json export RESTORE=${RUN_DIR}/../TTS/recipes/multispeaker/vits/config_experiments/best_model.pth CUDA_VISIBLE_DEVICES="0" python ${RECIPE} --config_path ${CONFIG} --restore_path ${RESTORE} \ --coqpit.lr_disc 0.0001 --coqpit.lr_gen 0.0001 \ --coqpit.batch_size 32
Thanks for letting me know, I'll run the same the setup and look into it
Thanks @loganhart420 Just a quick note, we confirmed that other hyparameters, specifically batch_size
can be changed as expected via restore_path
. We did not check all the other hyperparameters but it seems like lr_gen
and lr_disc
are affected specifically.
When you restore the model you also restore the scheduler and it overrides what you define on the terminal probably? @loganhart420 can you check if it is the case?
Hey @loganhart420, have you discovered anything else? The behavior is the same if the changes are made directly in the config.
Hey @loganhart420, have you discovered anything else? The behavior is the same if the changes are made directly in the config.
I haven't yet. I'm exploring this all morning I'll find out the solution
Good morning @loganhart420 , these days I have been working with this hard fix inside the restore_model() definition:
optimizer = restore_list_objs(checkpoint["optimizer"], optimizer)
--> optimizer = self.get_optimizer(self.model, self.config)
However this change does not allow to work with amp since this is initialised before doing restore model.
Good morning @loganhart420 , these days I have been working with this hard fix inside the restore_model() definition:
optimizer = restore_list_objs(checkpoint["optimizer"], optimizer)
-->optimizer = self.get_optimizer(self.model, self.config)
However this change does not allow to work with amp since this is initialised before doing restore model.
so this changes the lr_gen and lr_disc to the ones defined in the terminal?
@GerrySant
I've been unsuccessful in reproducing this. I was able to change learning rates for pretrained vctk model using cli. It might be something in your config file. I tried to train matching everything in your config except file paths and I was getting other errors. I can try to debug all other errors and see whats up but from my end lr_gen and lr_disc changed from 0.002 to 0.2 when using cli
@GerrySant
I've been unsuccessful in reproducing this. I was able to change learning rates for pretrained vctk model using cli. It might be something in your config file. I tried to train matching everything in your config except file paths and I was getting other errors. I can try to debug all other errors and see whats up but from my end lr_gen and lr_disc changed from 0.002 to 0.2 when using cli
I'm going go ahead and close this since I was unable to reproduce the bug using the same config. my suggestion is uninstall TTS and try a fresh install.
Good morning @loganhart420 , these days I have been working with this hard fix inside the restore_model() definition:
optimizer = restore_list_objs(checkpoint["optimizer"], optimizer)
-->optimizer = self.get_optimizer(self.model, self.config)
However this change does not allow to work with amp since this is initialised before doing restore model.so this changes the lr_gen and lr_disc to the ones defined in the terminal?
Hi @loganhart420, sorry for the late reply, these days it has been impossible for me to respond earlier.
As @erogol commented, when restoring the scechuler, the arguments related to it defined by the terminal or indicated in a new config are rewritten.
Therefore, with my change what I do is to overwrite them again with the arguments present in the new config. But as I mentioned, this change does not perform amp.initialize()
for the new optimizer, so it can generate problems with the use of AMP.