flowtron copied to clipboard
amount of data for single speaker
Hi, I am trying to develop the model for a single speaker , and I happen to have around roughly 10-12 minutes of data. would this be enough to get decent passable results? (I plan on using a pretrained model btw)
should I use the flowtron_ljs or flowtron_libritts2k (since this is few shots) ?
also a request if at all possible could you provide a colab notebook for training?
yes, it is possible to get decent results with the amount of data you have. the closer your speaker is to existing speakers in flowtron_libritts2k the better it will sound.
use flowtron_libritts2k, change config.json
to work with your data and call
python train.py -c config.json -p train_config.finetune_layers=["speaker_embedding.weight"] train_config.checkpoint_path="models/flowtron_libritts2k.pt"
you'll need to create a filelist (https://github.com/NVIDIA/flowtron/tree/master/filelists) for your data. you can set your speaker to any in libritts.
Thank you for the reply, I will train the model according to your recommendations and share my results soon!
Hi @rafaelvalle , I'm trying to finetune on a small dataset with the flowtron_libritts2k3k.pt model, however, I'm running into this error:
`if len(ignore_layers) > 0:
model_dict = {k: v for k, v in model_dict.items()
if k not in ignore_layers}
dummy_dict = model.state_dict()
model_dict = dummy_dict
File "train.py", line 125, in load_checkpoint optimizer.load_state_dict(checkpoint_dict['optimizer']) KeyError: 'optimizer'
It seems like there is no key for optimizer in the saved model. What's the right way to go about fixing this?
Hi @rafaelvalle , I'm trying to finetune on a small dataset with the flowtron_libritts2k3k.pt model, however, I'm running into this error:
`if len(ignore_layers) > 0:
model_dict = {k: v for k, v in model_dict.items() if k not in ignore_layers} dummy_dict = model.state_dict() dummy_dict.update(model_dict) model_dict = dummy_dict else: optimizer.load_state_dict(checkpoint_dict['optimizer'])`
File "train.py", line 125, in load_checkpoint optimizer.load_state_dict(checkpoint_dict['optimizer']) KeyError: 'optimizer'
It seems like there is no key for optimizer in the saved model. What's the right way to go about fixing this?
This happens if you use flowtron_libritts2k3k.pt as config.checkpoint_path using the pretrained model to warmstart (config.warmstart_checkpoint_path) instead should solve it
Great, thanks! @stqc
So after about 102k iterations, the audio generated sounds exactly like the speaker but the spoken words are not coherent at all and there is also a weird shape to the attention_weight (training with flow =1)
the following is the config.json
{ "train_config": { "output_directory": "H:/fs", "epochs": 10000000, "optim_algo": "RAdam", "learning_rate": 1e-4, "weight_decay": 1e-6, "grad_clip_val": 1, "sigma": 1.0, "iters_per_checkpoint": 1000, "batch_size": 2, "seed": 1234, "checkpoint_path": "", "ignore_layers": [], "finetune_layers": [], "include_layers": ["encoder", "embedding"], "warmstart_checkpoint_path": "", "with_tensorboard": true, "fp16_run": true }, "data_config": { "training_files": "filelists/jennidata1.txt", "validation_files": "filelists/val.txt", "text_cleaners": ["flowtron_cleaners"], "p_arpabet": 0.0, "cmudict_path": "data/cmudict_dictionary", "sampling_rate": 22050, "filter_length": 1024, "hop_length": 256, "win_length": 1024, "mel_fmin": 0.0, "mel_fmax": 8000.0, "max_wav_value": 32768.0, "use_attn_prior": true, "attn_prior_threshold": 1e-4, "keep_ambiguous": false }, "dist_config": { "dist_backend": "nccl", "dist_url": "tcp://localhost:54321" }, "model_config": { "n_speakers": 1, "n_speaker_dim": 128, "n_text": 185, "n_text_dim": 512, "n_flows": 1, "n_mel_channels": 80, "n_attn_channels": 640, "n_hidden": 1024, "n_lstm_layers": 2, "mel_encoder_n_hidden": 512, "n_components": 0, "mean_scale": 0.0, "fixed_gaussian": true, "dummy_speaker_embedding": false, "use_gate_layer": true, "use_cumm_attention": false } }
Hey @stqc , do you follow the 2-step training method? i.e. training with attention prior, then training without.
I'm quite a newbie to this, and trying to train on about 20mins of speaker data.
I have set the config.json like this:
{ "train_config": { "output_directory": "outdir", "epochs": 10000000, "optim_algo": "RAdam", "learning_rate": 1e-3, "weight_decay": 1e-6, "grad_clip_val": 1, "sigma": 1.0, "iters_per_checkpoint": 1000, "batch_size": 8, "seed": 1234, "checkpoint_path": "", "ignore_layers": [], "finetune_layers": [], "include_layers": ["speaker", "encoder", "embedding"], "warmstart_checkpoint_path": "", "with_tensorboard": true, "fp16_run": true }, "data_config": { "training_files": "filelists/pen_train.txt", "validation_files": "filelists/pen_val.txt", "text_cleaners": ["flowtron_cleaners"], "p_arpabet": 0.5, "cmudict_path": "data/cmudict_dictionary", "sampling_rate": 22050, "filter_length": 1024, "hop_length": 256, "win_length": 1024, "mel_fmin": 0.0, "mel_fmax": 8000.0, "max_wav_value": 32768.0, "use_attn_prior": true, "attn_prior_threshold": 1e-4, "keep_ambiguous": false }, "dist_config": { "dist_backend": "nccl", "dist_url": "tcp://localhost:54321" }, "model_config": { "n_speakers": 2311, "n_speaker_dim": 128, "n_text": 185, "n_text_dim": 512, "n_flows": 2, "n_mel_channels": 80, "n_attn_channels": 640, "n_hidden": 1024, "n_lstm_layers": 2, "mel_encoder_n_hidden": 512, "n_components": 0, "mean_scale": 0.0, "fixed_gaussian": true, "dummy_speaker_embedding": false, "use_gate_layer": true, "use_cumm_attention": false } }
And executing the following command:
python train.py -c config.json -p train_config.finetune_layers=["speaker_embedding.weight"] train_config.warmstart_checkpoint_path="models/flowtron_libritts2p3k.pt"
and for inference:
python inference.py -c config.json -f outdir/model_9000 -w models/waveglow_256channels_universal_v5.pt -t "It is well know that deep generative models have a deep latent space!" -i 40
Note: I have speaker ID set as 40 in my file_list.
I don't know how to go about the 2-stage training. I just let it train once, and then call inference. However, my results are really bad. Would really appreciate if you can guide me a bit about training/ and if my config.json looks fine. And how many steps should I train before stopping, from the validation and train loss it looks fine as both are going down. I don't know how to interpret the attention plots though.
What's the GPU memory requirements to run this model? Doesn't look like I can run it on a 8GB 2070 Super even if I reduce batch size to 1. Any other ways to squeeze into this memory?
What's the GPU memory requirements to run this model? Doesn't look like I can run it on a 8GB 2070 Super even if I reduce batch size to 1. Any other ways to squeeze into this memory?
See my answer to #119
@shehrum you need to first train with the attention prior enabled and then disable it and resume training once the attention looks good.
do you mean this setting?
attention prior enabled
"use_attn_prior": **true**,
"attn_prior_threshold": 0.0,
"prior_cache_path": "/attention_prior_cache",
then disable it
"use_attn_prior": **false**,
"attn_prior_threshold": 0.0,
"prior_cache_path": "/attention_prior_cache",
thx for info.
actually, i tried
Fine-tuning for few-shot speech synthesis sid0_sigma0.5_attnlayer0 has a clear attn map, while, sid0_sigma0.5_attnlayer1 failed to form a clear attn map.