flowtron
flowtron copied to clipboard
amount of data for single speaker
Hi, I am trying to develop the model for a single speaker , and I happen to have around roughly 10-12 minutes of data. would this be enough to get decent passable results? (I plan on using a pretrained model btw)
should I use the flowtron_ljs or flowtron_libritts2k (since this is few shots) ?
also a request if at all possible could you provide a colab notebook for training?
yes, it is possible to get decent results with the amount of data you have. the closer your speaker is to existing speakers in flowtron_libritts2k the better it will sound.
use flowtron_libritts2k, change config.json
to work with your data and call
python train.py -c config.json -p train_config.finetune_layers=["speaker_embedding.weight"] train_config.checkpoint_path="models/flowtron_libritts2k.pt"
you'll need to create a filelist (https://github.com/NVIDIA/flowtron/tree/master/filelists) for your data. you can set your speaker to any in libritts.
Thank you for the reply, I will train the model according to your recommendations and share my results soon!
Hi @rafaelvalle , I'm trying to finetune on a small dataset with the flowtron_libritts2k3k.pt model, however, I'm running into this error:
`if len(ignore_layers) > 0:
model_dict = {k: v for k, v in model_dict.items()
if k not in ignore_layers}
dummy_dict = model.state_dict()
dummy_dict.update(model_dict)
model_dict = dummy_dict
else:
optimizer.load_state_dict(checkpoint_dict['optimizer'])`
File "train.py", line 125, in load_checkpoint optimizer.load_state_dict(checkpoint_dict['optimizer']) KeyError: 'optimizer'
It seems like there is no key for optimizer in the saved model. What's the right way to go about fixing this?
Hi @rafaelvalle , I'm trying to finetune on a small dataset with the flowtron_libritts2k3k.pt model, however, I'm running into this error:
`if len(ignore_layers) > 0:
model_dict = {k: v for k, v in model_dict.items() if k not in ignore_layers} dummy_dict = model.state_dict() dummy_dict.update(model_dict) model_dict = dummy_dict else: optimizer.load_state_dict(checkpoint_dict['optimizer'])`
File "train.py", line 125, in load_checkpoint optimizer.load_state_dict(checkpoint_dict['optimizer']) KeyError: 'optimizer'
It seems like there is no key for optimizer in the saved model. What's the right way to go about fixing this?
This happens if you use flowtron_libritts2k3k.pt as config.checkpoint_path using the pretrained model to warmstart (config.warmstart_checkpoint_path) instead should solve it
Great, thanks! @stqc
So after about 102k iterations, the audio generated sounds exactly like the speaker but the spoken words are not coherent at all and there is also a weird shape to the attention_weight (training with flow =1)
the following is the config.json
{ "train_config": { "output_directory": "H:/fs", "epochs": 10000000, "optim_algo": "RAdam", "learning_rate": 1e-4, "weight_decay": 1e-6, "grad_clip_val": 1, "sigma": 1.0, "iters_per_checkpoint": 1000, "batch_size": 2, "seed": 1234, "checkpoint_path": "", "ignore_layers": [], "finetune_layers": [], "include_layers": ["encoder", "embedding"], "warmstart_checkpoint_path": "", "with_tensorboard": true, "fp16_run": true }, "data_config": { "training_files": "filelists/jennidata1.txt", "validation_files": "filelists/val.txt", "text_cleaners": ["flowtron_cleaners"], "p_arpabet": 0.0, "cmudict_path": "data/cmudict_dictionary", "sampling_rate": 22050, "filter_length": 1024, "hop_length": 256, "win_length": 1024, "mel_fmin": 0.0, "mel_fmax": 8000.0, "max_wav_value": 32768.0, "use_attn_prior": true, "attn_prior_threshold": 1e-4, "keep_ambiguous": false }, "dist_config": { "dist_backend": "nccl", "dist_url": "tcp://localhost:54321" }, "model_config": { "n_speakers": 1, "n_speaker_dim": 128, "n_text": 185, "n_text_dim": 512, "n_flows": 1, "n_mel_channels": 80, "n_attn_channels": 640, "n_hidden": 1024, "n_lstm_layers": 2, "mel_encoder_n_hidden": 512, "n_components": 0, "mean_scale": 0.0, "fixed_gaussian": true, "dummy_speaker_embedding": false, "use_gate_layer": true, "use_cumm_attention": false } }
Hey @stqc , do you follow the 2-step training method? i.e. training with attention prior, then training without.
I'm quite a newbie to this, and trying to train on about 20mins of speaker data.
I have set the config.json like this:
{ "train_config": { "output_directory": "outdir", "epochs": 10000000, "optim_algo": "RAdam", "learning_rate": 1e-3, "weight_decay": 1e-6, "grad_clip_val": 1, "sigma": 1.0, "iters_per_checkpoint": 1000, "batch_size": 8, "seed": 1234, "checkpoint_path": "", "ignore_layers": [], "finetune_layers": [], "include_layers": ["speaker", "encoder", "embedding"], "warmstart_checkpoint_path": "", "with_tensorboard": true, "fp16_run": true }, "data_config": { "training_files": "filelists/pen_train.txt", "validation_files": "filelists/pen_val.txt", "text_cleaners": ["flowtron_cleaners"], "p_arpabet": 0.5, "cmudict_path": "data/cmudict_dictionary", "sampling_rate": 22050, "filter_length": 1024, "hop_length": 256, "win_length": 1024, "mel_fmin": 0.0, "mel_fmax": 8000.0, "max_wav_value": 32768.0, "use_attn_prior": true, "attn_prior_threshold": 1e-4, "keep_ambiguous": false }, "dist_config": { "dist_backend": "nccl", "dist_url": "tcp://localhost:54321" }, "model_config": { "n_speakers": 2311, "n_speaker_dim": 128, "n_text": 185, "n_text_dim": 512, "n_flows": 2, "n_mel_channels": 80, "n_attn_channels": 640, "n_hidden": 1024, "n_lstm_layers": 2, "mel_encoder_n_hidden": 512, "n_components": 0, "mean_scale": 0.0, "fixed_gaussian": true, "dummy_speaker_embedding": false, "use_gate_layer": true, "use_cumm_attention": false } }
And executing the following command:
python train.py -c config.json -p train_config.finetune_layers=["speaker_embedding.weight"] train_config.warmstart_checkpoint_path="models/flowtron_libritts2p3k.pt"
and for inference:
python inference.py -c config.json -f outdir/model_9000 -w models/waveglow_256channels_universal_v5.pt -t "It is well know that deep generative models have a deep latent space!" -i 40
Note: I have speaker ID set as 40 in my file_list.
I don't know how to go about the 2-stage training. I just let it train once, and then call inference. However, my results are really bad. Would really appreciate if you can guide me a bit about training/ and if my config.json looks fine. And how many steps should I train before stopping, from the validation and train loss it looks fine as both are going down. I don't know how to interpret the attention plots though.
What's the GPU memory requirements to run this model? Doesn't look like I can run it on a 8GB 2070 Super even if I reduce batch size to 1. Any other ways to squeeze into this memory?
Thanks.
What's the GPU memory requirements to run this model? Doesn't look like I can run it on a 8GB 2070 Super even if I reduce batch size to 1. Any other ways to squeeze into this memory?
Thanks.
See my answer to #119
@shehrum you need to first train with the attention prior enabled and then disable it and resume training once the attention looks good.
@rafaelvalle
do you mean this setting?
attention prior enabled
"use_attn_prior": **true**,
"attn_prior_threshold": 0.0,
"prior_cache_path": "/attention_prior_cache",
then disable it
"use_attn_prior": **false**,
"attn_prior_threshold": 0.0,
"prior_cache_path": "/attention_prior_cache",
thx for info.
actually, i tried
Fine-tuning for few-shot speech synthesis sid0_sigma0.5_attnlayer0 has a clear attn map, while, sid0_sigma0.5_attnlayer1 failed to form a clear attn map.