TTS icon indicating copy to clipboard operation
TTS copied to clipboard

[Bug] The model does not train with the new hyperparameters given by command line when trying to restart a training with `restore_path` and `continue_path`

Open GerrySant opened this issue 2 years ago β€’ 9 comments

Describe the bug

I am trying to continue the training of a multi-speaker VITS model in Catalan with 4 16GB V100 GPUs.

I want to try to modify some different hyperparameters (like the learnig rate) to find the most optimal configuration. When launching the new trainings with the --restore_path argument and other hyperparameter arguments, a new config is created with the updated hyperparameters. However, in the training, the model does not use these new hyperparameters, but uses the same ones that appeared in the original model config.

In the "to reproduce" section I attach the config of the original training and the config, the logs and the command line used to run the new training.

Regarding the --contine_path argument, when wanting to continue the training from the same point where the training stopped, the model resets the learning rate to that of the original config.

As in both cases the behavior is the same (using the parameters of the original config ignoring the new ones passed by command line) I thought it appropriate to mention them in the same issue.

To Reproduce

Original config: config.txt

New generated config: config.txt logs of the new training: trainer_0_log.txt The above logs show current_lr_0: 0.00050 current_lr_1: 0.00050:

[1m   --> STEP: 24/1620 -- GLOBAL_STEP: 170025[0m
     | > loss_disc: 2.35827  (2.46076)
     | > loss_disc_real_0: 0.14623  (0.14530)
     | > loss_disc_real_1: 0.23082  (0.20939)
     | > loss_disc_real_2: 0.22020  (0.21913)
     | > loss_disc_real_3: 0.19430  (0.22623)
     | > loss_disc_real_4: 0.21045  (0.22390)
     | > loss_disc_real_5: 0.20165  (0.23435)
     | > loss_0: 2.35827  (2.46076)
     | > grad_norm_0: 24.36758  (16.55595)
     | > loss_gen: 2.37695  (2.37794)
     | > loss_kl: 2.56117  (2.30560)
     | > loss_feat: 9.57505  (8.38634)
     | > loss_mel: 22.84378  (22.47223)
     | > loss_duration: 1.59958  (1.55717)
     | > loss_1: 38.95654  (37.09929)
     | > grad_norm_1: 192.16046  (145.46979)
     | > current_lr_0: 0.00050 
     | > current_lr_1: 0.00050 
     | > step_time: 0.96620  (1.22051)
     | > loader_time: 0.00510  (0.00600)

Below I attach the command line used to launch the new training:

export RECIPE="${RUN_DIR}/recipes/multispeaker/vits/experiments/train_vits_ca.py"
export RESTORE="${RUN_DIR}/recipes/multispeaker/vits/experiments/checkpoint_vits_170000.pth"

python -m trainer.distribute --script ${RECIPE} -gpus "0,1,2,3" \
--restore_path ${RESTORE} --coqpit.lr_gen 0.0002 --coqpit.lr_disc 0.0002 \
--coqpit.eval_batch_size 8 --coqpit.epochs 4 --coqpit.batch_size 16

Expected behavior

No response

Logs

No response

Environment

{
    "CUDA": {
        "GPU": [
            "Tesla V100-SXM2-16GB",
            "Tesla V100-SXM2-16GB",
            "Tesla V100-SXM2-16GB",
            "Tesla V100-SXM2-16GB"
        ],
        "available": true,
        "version": "10.2"
    },
    "Packages": {
        "PyTorch_debug": false,
        "PyTorch_version": "1.9.0a0+git3d70ab0",
        "TTS": "0.6.2",
        "numpy": "1.19.5"
    },
    "System": {
        "OS": "Linux",
        "architecture": [
            "64bit",
            "ELF"
        ],
        "processor": "ppc64le",
        "python": "3.7.4",
        "version": "#1 SMP Tue Sep 25 12:28:39 EDT 2018"
    }
}

Additional context

Trainer was updated to trainer==0.0.13. Please let me know if you need more information and thank you in advance.

GerrySant avatar Sep 12 '22 11:09 GerrySant

Hi @GerrySant did you try running train-tts.py instead of a recipe as suggested by @erogol (see https://github.com/coqui-ai/TTS/discussions/1719#discussioncomment-3187998)? I believe I was facing the same issue and this trick solved it.

Ca-ressemble-a-du-fake avatar Sep 13 '22 20:09 Ca-ressemble-a-du-fake

Good morning @Ca-ressemble-a-du-fake, I tried to use train_tts.py and now I have another error related to the own data. I've been trying to see where the error comes from to give you an answer as soon as possible but I'm still working on it. I suspect it has to do with not finding the custom formatter for my data. I attach the log of the new error. Thanks for the quick response. multispeaker_runner_gs_0002_8_16_restore_config_6283248.txt

GerrySant avatar Sep 19 '22 07:09 GerrySant

Hi @GerrySant the "name" parameter in the "datasets" section in your config file deals with the formatter name not the voice name. So if you formatted your dataset with an LJSpeech like structure, then you should use "ljspeech" in "datasets/name". If you need a custom formatter then define it in "TTS/TTS/tts/dataset/formatters.py" and then use this name in your config file.

Ca-ressemble-a-du-fake avatar Sep 20 '22 04:09 Ca-ressemble-a-du-fake

Hi @Ca-ressemble-a-du-fake, Thanks again for your quick response.

I have structured the data so that each speaker has its own metadata and folder. Therefore, I am treating the data for each of my speakers as if they were separate datasets. However, I am going to try the option of adding a custom formatter in formatters.py. As soon as I have new information I will do an update.

GerrySant avatar Sep 20 '22 07:09 GerrySant

Hi @Ca-ressemble-a-du-fake, Thanks again for your quick response.

I have structured the data so that each speaker has its own metadata and folder. Therefore, I am treating the data for each of my speakers as if they were separate datasets. However, I am going to try the option of adding a custom formatter in formatters.py. As soon as I have new information I will do an update.

you can pass a formatter in 'load_tts_samples' function: https://github.com/coqui-ai/TTS/blob/dev/TTS/tts/datasets/init.py#L77 this should work just as well as adding the formatter the formatters.py

loganhart02 avatar Sep 21 '22 11:09 loganhart02

Good morning @Ca-ressemble-a-du-fake, I tried to use train_tts.py and now I have another error related to the own data. I've been trying to see where the error comes from to give you an answer as soon as possible but I'm still working on it. I suspect it has to do with not finding the custom formatter for my data. I attach the log of the new error. Thanks for the quick response. multispeaker_runner_gs_0002_8_16_restore_config_6283248.txt

put your formatter anywhere in the 'train_tts.py` function and make sure you pass it here: https://github.com/coqui-ai/TTS/blob/dev/TTS/bin/train_tts.py#L47 let me know if this fixes the problem with hyper params

loganhart02 avatar Sep 21 '22 11:09 loganhart02

Hi @loganhart420 I am the colleague of @GerrySant. In the end we restructured our data in vctk_old format and launched some processes using the train_tts.py, and we still have the same problem, i.e. the stderr shows that the lr_gen and lr_disc used are not consistent with the value coming from coqpit. This time we tried it both for v0.6.2 and v0.8.0.

Although the results are the same for all (the initially shared configs and the new two) I am attaching the input and output configs plus the log for the process launched using TTS v0.8.0.

For the command:

export RUN_DIR=./TTS_v0.8.0
module purge
source $RUN_DIR/use_venv.sh

export RECIPE=${RUN_DIR}/TTS/bin/train_tts.py
export CONFIG=${RUN_DIR}/recipes/multispeaker/config_experiments/config_mixed.json
export RESTORE=${RUN_DIR}/../TTS/recipes/multispeaker/vits/config_experiments/best_model.pth

CUDA_VISIBLE_DEVICES="0" python ${RECIPE} --config_path ${CONFIG} --restore_path ${RESTORE} \
                                          --coqpit.lr_disc 0.0001 --coqpit.lr_gen 0.0001 \
                                          --coqpit.batch_size 32

files: trainer_0_log.txt config_input.txt config_output.txt

gullabi avatar Sep 21 '22 14:09 gullabi

Hi @loganhart420 I am the colleague of @GerrySant. In the end we restructured our data in vctk_old format and launched some processes using the train_tts.py, and we still have the same problem, i.e. the stderr shows that the lr_gen and lr_disc used are not consistent with the value coming from coqpit. This time we tried it both for v0.6.2 and v0.8.0.

Although the results are the same for all (the initially shared configs and the new two) I am attaching the input and output configs plus the log for the process launched using TTS v0.8.0.

For the command:

export RUN_DIR=./TTS_v0.8.0
module purge
source $RUN_DIR/use_venv.sh

export RECIPE=${RUN_DIR}/TTS/bin/train_tts.py
export CONFIG=${RUN_DIR}/recipes/multispeaker/config_experiments/config_mixed.json
export RESTORE=${RUN_DIR}/../TTS/recipes/multispeaker/vits/config_experiments/best_model.pth

CUDA_VISIBLE_DEVICES="0" python ${RECIPE} --config_path ${CONFIG} --restore_path ${RESTORE} \
                                          --coqpit.lr_disc 0.0001 --coqpit.lr_gen 0.0001 \
                                          --coqpit.batch_size 32

files: trainer_0_log.txt config_input.txt config_output.txt

Thanks for letting me know, I'll run the same the setup and look into it

loganhart02 avatar Sep 21 '22 14:09 loganhart02

Thanks @loganhart420 Just a quick note, we confirmed that other hyparameters, specifically batch_size can be changed as expected via restore_path. We did not check all the other hyperparameters but it seems like lr_gen and lr_disc are affected specifically.

gullabi avatar Sep 22 '22 10:09 gullabi

When you restore the model you also restore the scheduler and it overrides what you define on the terminal probably? @loganhart420 can you check if it is the case?

erogol avatar Sep 26 '22 09:09 erogol

Hey @loganhart420, have you discovered anything else? The behavior is the same if the changes are made directly in the config.

GerrySant avatar Oct 13 '22 07:10 GerrySant

Hey @loganhart420, have you discovered anything else? The behavior is the same if the changes are made directly in the config.

I haven't yet. I'm exploring this all morning I'll find out the solution

loganhart02 avatar Nov 11 '22 14:11 loganhart02

Good morning @loganhart420 , these days I have been working with this hard fix inside the restore_model() definition:

optimizer = restore_list_objs(checkpoint["optimizer"], optimizer) --> optimizer = self.get_optimizer(self.model, self.config)

However this change does not allow to work with amp since this is initialised before doing restore model.

GerrySant avatar Nov 11 '22 17:11 GerrySant

Good morning @loganhart420 , these days I have been working with this hard fix inside the restore_model() definition:

optimizer = restore_list_objs(checkpoint["optimizer"], optimizer) --> optimizer = self.get_optimizer(self.model, self.config)

However this change does not allow to work with amp since this is initialised before doing restore model.

so this changes the lr_gen and lr_disc to the ones defined in the terminal?

loganhart02 avatar Nov 11 '22 19:11 loganhart02

@GerrySant

I've been unsuccessful in reproducing this. I was able to change learning rates for pretrained vctk model using cli. It might be something in your config file. I tried to train matching everything in your config except file paths and I was getting other errors. I can try to debug all other errors and see whats up but from my end lr_gen and lr_disc changed from 0.002 to 0.2 when using cli

loganhart02 avatar Nov 12 '22 05:11 loganhart02

@GerrySant

I've been unsuccessful in reproducing this. I was able to change learning rates for pretrained vctk model using cli. It might be something in your config file. I tried to train matching everything in your config except file paths and I was getting other errors. I can try to debug all other errors and see whats up but from my end lr_gen and lr_disc changed from 0.002 to 0.2 when using cli

I'm going go ahead and close this since I was unable to reproduce the bug using the same config. my suggestion is uninstall TTS and try a fresh install.

loganhart02 avatar Nov 12 '22 16:11 loganhart02

Good morning @loganhart420 , these days I have been working with this hard fix inside the restore_model() definition: optimizer = restore_list_objs(checkpoint["optimizer"], optimizer) --> optimizer = self.get_optimizer(self.model, self.config) However this change does not allow to work with amp since this is initialised before doing restore model.

so this changes the lr_gen and lr_disc to the ones defined in the terminal?

Hi @loganhart420, sorry for the late reply, these days it has been impossible for me to respond earlier.

As @erogol commented, when restoring the scechuler, the arguments related to it defined by the terminal or indicated in a new config are rewritten.

Therefore, with my change what I do is to overwrite them again with the arguments present in the new config. But as I mentioned, this change does not perform amp.initialize() for the new optimizer, so it can generate problems with the use of AMP.

GerrySant avatar Nov 15 '22 08:11 GerrySant