nmt icon indicating copy to clipboard operation
nmt copied to clipboard

InvalidArgumentError (see above for traceback): Found Inf or NaN global norm. : Tensor had Inf values when training GRU

Open hichiaty opened this issue 6 years ago • 3 comments

Hi,

I recently trained an LSTM-with-attention based model using the following hparams:

python3.6 -m nmt.nmt
--attention=luong
--src=r --tgt=p
--vocab_prefix=/home/hisham/nmt2_data/vocab
--train_prefix=/home/hisham/nmt2_data/train
--dev_prefix=/home/hisham/nmt2_data/valid
--test_prefix=/home/hisham/nmt2_data/test
--out_dir=/home/hisham/nmt2_attention_model
--num_train_steps=92700
--steps_per_stats=100
--num_layers=2
--num_units=1024
--metrics=accuracy
--encoder_type=bi
--learning_rate=0.355
--decay_scheme=luong5
--num_gpus=2

This trained successfully - even with the modifications I made to the model which are: source and target max lengths set to None hard coded num_units for the decoder to 2048 - leaving the encoder with 1024 units modification of the luong5 decay scheme.

This worked perfectly with an LSTM unit type, but when I try training the same exact model on the same data with a GRU like this:

python3.6 -m nmt.nmt
--attention=luong
--src=r --tgt=p
--vocab_prefix=/home/hisham/nmt2_data/vocab
--train_prefix=/home/hisham/nmt2_data/train
--dev_prefix=/home/hisham/nmt2_data/valid
--test_prefix=/home/hisham/nmt2_data/test
--out_dir=/home/hisham/GRU-luong-model
--num_train_steps=92700
--steps_per_stats=100
--num_layers=2
--num_units=1024
--metrics=accuracy
--encoder_type=bi
--learning_rate=0.355
--decay_scheme=luong5
--num_gpus=2
--unit_type=gru

I get the error Found Inf or NaN global norm and a ludicrously high perplexity as it's training before the error.

Anyone know why this happens?

P.S. I also get the same error with LSTM when I implement variational dropout.

Thanks.

hichiaty avatar Nov 07 '18 17:11 hichiaty

I am having the same issue using ppo2 from openai with mlp network.

I tried increasing batch size, delays the error but still happens. Also tried decreasing learning rate. no real effect.

mleonrivas avatar Nov 11 '18 22:11 mleonrivas

I don't know

yunchaosuper avatar Jan 20 '19 12:01 yunchaosuper

I'm having the same issue, any updates here?

puqunyan avatar Mar 06 '19 23:03 puqunyan