glow-tts icon indicating copy to clipboard operation
glow-tts copied to clipboard

ZeroDivisionError: float division by zero when training the model

Open mataym opened this issue 4 years ago • 3 comments

Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.691694759794e-311 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.345847379897e-311 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.1729236899484e-311 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.086461844974e-311 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.43230922487e-312 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.716154612436e-312 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.35807730622e-312 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.7903865311e-313 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.39519326554e-313 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.69759663277e-313 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.487983164e-314 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.243991582e-314 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.121995791e-314 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0609978955e-314 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.304989477e-315 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.65249474e-315 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.32624737e-315 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.63123685e-316 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.3156184e-316 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.6578092e-316 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.289046e-317 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.144523e-317 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.0722615e-317 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.036131e-317 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.180654e-318 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.590327e-318 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.295163e-318 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.4758e-319 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.2379e-319 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.61895e-319 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.095e-320 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.0474e-320 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.0237e-320 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.012e-320 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.06e-321 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.53e-321 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.265e-321 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.3e-322 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.16e-322 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.6e-322 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8e-323 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4e-323 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2e-323 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1e-323 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5e-324 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.0 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.6e-322 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8e-323 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4e-323 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2e-323 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1e-323 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5e-324 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.0 Traceback (most recent call last): File "train.py", line 189, in main() File "train.py", line 34, in main mp.spawn(train_and_eval, nprocs=n_gpus, args=(n_gpus, hps,)) File "/home/nur-179/anaconda3/envs/gtts/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 167, in spawn while not spawn_context.join(): File "/home/nur-179/anaconda3/envs/gtts/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 114, in join raise Exception(msg) Exception:

-- Process 1 terminated with the following error: Traceback (most recent call last): File "/home/nur-179/anaconda3/envs/gtts/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap fn(i, *args) File "/home/nur-179/.temp/glow-tts/train.py", line 91, in train_and_eval train(rank, epoch, hps, generator, optimizer_g, train_loader, None, None) File "/home/nur-179/.temp/glow-tts/train.py", line 115, in train scaled_loss.backward() File "/home/nur-179/anaconda3/envs/gtts/lib/python3.6/contextlib.py", line 88, in exit next(self.gen) File "/home/nur-179/anaconda3/envs/gtts/lib/python3.6/site-packages/apex/amp/handle.py", line 123, in scale_loss optimizer._post_amp_backward(loss_scaler) File "/home/nur-179/anaconda3/envs/gtts/lib/python3.6/site-packages/apex/amp/_process_optimizer.py", line 249, in post_backward_no_master_weights post_backward_models_are_masters(scaler, params, stashed_grads) File "/home/nur-179/anaconda3/envs/gtts/lib/python3.6/site-packages/apex/amp/_process_optimizer.py", line 135, in post_backward_models_are_masters scale_override=(grads_have_scale, stashed_have_scale, out_scale)) File "/home/nur-179/anaconda3/envs/gtts/lib/python3.6/site-packages/apex/amp/scaler.py", line 176, in unscale_with_stashed out_scale/grads_have_scale, # 1./scale, ZeroDivisionError: float division by zero

my base.json file is as follows: { "train": { "use_cuda": true, "log_interval": 20, "seed": 1234, "epochs": 10000, "learning_rate": 1e0, "betas": [0.9, 0.98], "eps": 1e-9, "warmup_steps": 4000, "scheduler": "noam", "batch_size": 4, "ddi": true, "fp16_run": true }, "data": { "load_mel_from_disk": false, "training_files":"filelists/ljs_audio_text_train_filelist.txt", "validation_files":"filelists/ljs_audio_text_val_filelist.txt", "text_cleaners":["transliteration_cleaners"], "max_wav_value": 32768.0, "sampling_rate": 44100, "filter_length": 1024, "hop_length": 256, "win_length": 1024, "n_mel_channels": 80, "mel_fmin": 0.0, "mel_fmax": 8000.0, "add_noise": true, "add_space": false, "cmudict_path": "data/dict" }, "model": { "hidden_channels": 192, "filter_channels": 768, "filter_channels_dp": 256, "kernel_size": 3, "p_dropout": 0.1, "n_blocks_dec": 12, "n_layers_enc": 6, "n_heads": 2, "p_dropout_dec": 0.05, "dilation_rate": 1, "kernel_size_dec": 5, "n_block_layers": 4, "n_sqz": 2, "prenet": true, "mean_only": true, "hidden_channels_enc": 192, "hidden_channels_dec": 192, "window_size": 4 } }

mataym avatar Sep 19 '20 16:09 mataym

@mataym I am getting the same issue. Everything works fine on a different dataset so I am assuming it is something with my new one, but I can't figure out what will cause this difference, so I am not sure this is the problem. I understand from searching around that it probably has to do with apex but I am not sure what that means and how to fix it. Did you happen to solve the problem for yourself? What did it end up being?

Zarbuvit avatar Oct 08 '20 07:10 Zarbuvit

@mataym I am getting the same issue. Everything works fine on a different dataset so I am assuming it is something with my new one, but I can't figure out what will cause this difference, so I am not sure this is the problem. I understand from searching around that it probably has to do with apex but I am not sure what that means and how to fix it. Did you happen to solve the problem for yourself? What did it end up being?

i solved the problem by removing one wav file which has no sound, i suggest u check all of ur wav file's sample lenth in ur dataset.

mataym avatar Oct 13 '20 01:10 mataym

@mataym thank you! For me it ended up being that my txt files and wav files weren't corresponding properly by name.

Zarbuvit avatar Oct 13 '20 07:10 Zarbuvit