nmt
nmt copied to clipboard
InvalidArgumentError (see above for traceback): Found Inf or NaN global norm. : Tensor had Inf values when training GRU
Hi,
I recently trained an LSTM-with-attention based model using the following hparams:
python3.6 -m nmt.nmt
--attention=luong
--src=r --tgt=p
--vocab_prefix=/home/hisham/nmt2_data/vocab
--train_prefix=/home/hisham/nmt2_data/train
--dev_prefix=/home/hisham/nmt2_data/valid
--test_prefix=/home/hisham/nmt2_data/test
--out_dir=/home/hisham/nmt2_attention_model
--num_train_steps=92700
--steps_per_stats=100
--num_layers=2
--num_units=1024
--metrics=accuracy
--encoder_type=bi
--learning_rate=0.355
--decay_scheme=luong5
--num_gpus=2
This trained successfully - even with the modifications I made to the model which are: source and target max lengths set to None hard coded num_units for the decoder to 2048 - leaving the encoder with 1024 units modification of the luong5 decay scheme.
This worked perfectly with an LSTM unit type, but when I try training the same exact model on the same data with a GRU like this:
python3.6 -m nmt.nmt
--attention=luong
--src=r --tgt=p
--vocab_prefix=/home/hisham/nmt2_data/vocab
--train_prefix=/home/hisham/nmt2_data/train
--dev_prefix=/home/hisham/nmt2_data/valid
--test_prefix=/home/hisham/nmt2_data/test
--out_dir=/home/hisham/GRU-luong-model
--num_train_steps=92700
--steps_per_stats=100
--num_layers=2
--num_units=1024
--metrics=accuracy
--encoder_type=bi
--learning_rate=0.355
--decay_scheme=luong5
--num_gpus=2
--unit_type=gru
I get the error Found Inf or NaN global norm and a ludicrously high perplexity as it's training before the error.
Anyone know why this happens?
P.S. I also get the same error with LSTM when I implement variational dropout.
Thanks.
I am having the same issue using ppo2 from openai with mlp network.
I tried increasing batch size, delays the error but still happens. Also tried decreasing learning rate. no real effect.
I don't know
I'm having the same issue, any updates here?