nmt InvalidArgumentError (see above for traceback): Found Inf or NaN global norm. : Tensor had Inf values when training GRU

InvalidArgumentError (see above for traceback): Found Inf or NaN global norm. : Tensor had Inf values when training GRU

Open hichiaty opened this issue 6 years ago • 3 comments

Hi,

I recently trained an LSTM-with-attention based model using the following hparams:

python3.6 -m nmt.nmt
--attention=luong
--src=r --tgt=p
--vocab_prefix=/home/hisham/nmt2_data/vocab
--train_prefix=/home/hisham/nmt2_data/train
--dev_prefix=/home/hisham/nmt2_data/valid
--test_prefix=/home/hisham/nmt2_data/test
--out_dir=/home/hisham/nmt2_attention_model
--num_train_steps=92700
--steps_per_stats=100
--num_layers=2
--num_units=1024
--metrics=accuracy
--encoder_type=bi
--learning_rate=0.355
--decay_scheme=luong5
--num_gpus=2

This trained successfully - even with the modifications I made to the model which are: source and target max lengths set to None hard coded num_units for the decoder to 2048 - leaving the encoder with 1024 units modification of the luong5 decay scheme.

This worked perfectly with an LSTM unit type, but when I try training the same exact model on the same data with a GRU like this:

python3.6 -m nmt.nmt
--attention=luong
--src=r --tgt=p
--vocab_prefix=/home/hisham/nmt2_data/vocab
--train_prefix=/home/hisham/nmt2_data/train
--dev_prefix=/home/hisham/nmt2_data/valid
--test_prefix=/home/hisham/nmt2_data/test
--out_dir=/home/hisham/GRU-luong-model
--num_train_steps=92700
--steps_per_stats=100
--num_layers=2
--num_units=1024
--metrics=accuracy
--encoder_type=bi
--learning_rate=0.355
--decay_scheme=luong5
--num_gpus=2
--unit_type=gru