returnn Training instability (Inf/nan score) with TensorFlow 1.15

It would crash at an early step in the first epoch with a message like:

...
pretrain epoch 1, step 59, cost:ctc 6.582720208647288, cost:output/output_prob 6.08799995325171, error:ctc 0.9999999632127583, error:decision 0.0, error:output/output_prob 0.9766354830935596, loss 10714.482, max_size:bpe 52, max_size:bpe0 51, max_size:data 988, mem_usage:GPU:0 4.7GB, num_seqs 20, 0.493 sec/step, elapsed 0:00:38, exp. remaining 0:07:34, complete 7.91%
pretrain epoch 1, step 60, cost:ctc 6.573515668942264, cost:output/output_prob nan, error:ctc 1.0000000295694917, error:decision 0.0, error:output/output_prob 0.977415296365507, loss nan, max_size:bpe 58, max_size:bpe0 57, max_size:data 1057, mem_usage:GPU:0 4.7GB, num_seqs 18, 0.522 sec/step, elapsed 0:00:39, exp. remaining 0:07:25, complete 8.14%
Model seems broken, got inf or nan score.
Accumulated scores: NumbersDict({'error:ctc': 41763.0, 'cost:ctc': 495062.81970214844, 'error:decision': 0.0, 'loss': nan, 'error:output/output_prob': 41228.0, 'cost:output/output_prob': nan})
Exception Exception('Inf/nan score in step 60.') in step 60.

Multiple people have independently reported this (@critias @patrick-wilken and myself @albertz).

Maybe related:

#253
returnn-experiments #34
StackOverflow: incorrect result of matrix multiplication (NaN) on GPU (Win 10, GTX 1080, CUDA 10.0, Python 3.7, TF 1.15/2.0; reports that CUDA 10.1 works fine)
tensorflow #31166 (CUDA 10.0, cudnn 7.6.2, TF 1.14; many people report that it is fixed in CUDA 10.1)

Upstream report:

tensorflow #40228

Our cases:

(@albertz) TensorFlow 1.15.0, CUDA 10.1, libcublas.so.10, libcudnn.so.7 (/usr/local/cuda-10.1_new/lib64/libcublas.so.10, /usr/local/cudnn-10.1-v7.6/lib64/libcudnn.so.7.6.3)

It is not clear at this point whether this is something wrong in RETURNN, or some bug in TF, or bug in CUDA, or some (unlucky but correct) properties of TF 1.15 lead more often to such instability?

It was mostly reported for RNNs, but no matter what LSTM implementation (NativeLstm2, LSTMBlock and StandardLSTM all can cause this). The same TF 1.15 worked fine so far for Transformer models.

So maybe it is just some bad luck. Although it seems to happen in many cases, so it would be good to find a workaround. Maybe it is also a bug (in TF or in RETURNN).

Jun 03 '20 21:06 albertz

For reference, with debug_add_check_numerics_ops = True:

InvalidArgumentError: optimize/gradients/lstm0_bw/rec/NativeLstm2_grad/GradOfNativeLstm2:0 : Tensor had NaN values
         [[node add_check_numerics_ops/optimize/gradients/lstm0_bw/rec/NativeLstm2_grad/GradOfNativeLstm2_check_numerics (defined at /work/tools/asr/python/3.8.0_tf_1.15-generic+cuda10.1/lib/python3.8/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]

Or:

InvalidArgumentError: optimize/gradients/output/rec/att/TensorArrayStack/TensorArrayGatherV3_grad/TensorArrayGrad/TensorArrayGradV3:1 : Tensor had NaN values 
         [[node add_check_numerics_ops/optimize/gradients/output/rec/att/TensorArrayStack/TensorArrayGatherV3_grad/TensorArrayGrad/TensorArrayGradV3_check_numerics (defined at /work/tools/asr/python/3.8.0_tf_1.15-generic+cuda10.1/lib/python3.8/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]

Or:

InvalidArgumentError: optimize/gradients/output/rec/subnet_base/input_layers_moved_out/target_embed_ta_unstack/target_embed_ta_unstack/target_embed_ta_unstack_grad/TensorArrayGrad/TensorArrayGradV3:1 : Tensor had NaN values

Or:

InvalidArgumentError: optimize/gradients/conv1/convolution_grad/Conv2DBackpropFilter:0 : Tensor had NaN values

Jun 04 '20 10:06 albertz

I just noticed that when I run python3 tests/test_TFEngine.py (i.e. all the tests in there), sometimes I get inf/nan. This is with TF 1.15.2, CUDA 10.0, GTX 2070. It is not deterministic at all. Often it all runs through fine. Also when it occurs on some test (e.g. test_attention_train_then_search), when I run this test specifically (via python3 tests/test_TFEngine.py test_attention_train_then_search), it does not occur anymore. It seems important to run the whole test suite. Maybe the problem is only triggered that way, when there was already certain (other) usage of the GPU before. @critias @pavelgolik maybe you can also try to run that? At least it is some way of test.

Jun 05 '20 15:06 albertz

python3 tests/test_TFEngine.py test_rec_subnet_train_t3d_simple can reproduce it (sometimes, actually quite rarely, maybe in 2% of the cases, so run in maybe 100 times to get the error), and that is even with TF 1.14, CUDA 10.1. It might be much more rare than with TF 1.15. Not really sure... (This is not at i6 but some own private env.) Edit: If you edit the test, and set "num_epochs": 200 in that test, it becomes much more deterministic. I get an inf in >50% of the cases.

Jun 06 '20 16:06 albertz

Some success! I have created a TF standalone test case here (latest version). This test case has some interesting behavior:

TF 1.14.0, CUDA 10.1, Python 3.7, i6: Tensor had Inf values after a few seconds
TF 1.15.0, CUDA 10.1, Python 3.8, i6: Very strange hickups in the loss (going up and down)! This should not happen. Inf took a bit longer but eventually also got Inf.
TF 2.2.0 (with tf.compat.v1.disable_v2_behavior()), CUDA 10.1, Python 3.7, private env, --reset_after_nsteps 100: Tensor had Inf values

I reported the problem upstream: tensorflow #40228

Jun 06 '20 21:06 albertz

Some update: Whenever CheckNumerics catches something TensorArray related, this might be a false positive. The TensorArray uses a dummy "flow" tensor (for gradient management). The value of this flow tensor is totally irrelevant. It's just used to track the dependency order of other operations w.r.t. the TensorArray. So that is why TF does not even initialize this tensor on GPU (see code), and thus this flow tensor can be inf or nan (or anything). Which does not matter at all, as the value is ignored (but CheckNumerics might not ignore it).

So it means that the test case might not actually reproduce the problem. Also the check_numerics output of RETURNN reports definitely the wrong tensors. So we need some further debugging on the original failing/unstable RETURNN setup...

Or further extend the test case. Actually the strange hickups (going up and down) I have seen in the test case should not happen.

Jun 13 '20 13:06 albertz

@critias @pavelgolik @manish-kumar-garg maybe you can try with TF 2?

Jun 13 '20 13:06 albertz

Note: I was able to reproduce the problem again, with a slightly adopted test case (current version). This test case now removes the check_numerics calls (which get false positives for TensorArray), but is able to reproduce getting nan after a few steps:

step 19, loss: 0.94387317
step 20, loss: 0.93824387
step 21, loss: 0.9327911
step 22, loss: 0.92746085
ERR, loss invalid: nan
step 23, loss: nan
ERR, loss invalid: nan
step 24, loss: nan

It was crucial in the test to use LSTMBlockCell. Which is interesting, because the inner matmul inside LSTMBlockCell is the same implementation which we also use for our native LSTM cell. Maybe this way to call matmul is somehow not correct. Also very interesting is that energy = tf.where(energy_mask, energy, float("-inf") * tf.ones_like(energy)) is also important to trigger the problem. Without that, it does not happen. Random hypothesis: Only this line causes that there is some nan in the GPU memory, which is later released (but not reset), and somehow matmul makes use of uninitialized GPU memory, and gets this nan now...

Also, this only happens with TF 1.15 now. Specifically: TF 1.15.0, CUDA 10.1, GeForce GTX 1080 Ti, i6 env. It does not happen with TF 1.14.0 or TF 2.2.0.

Jun 13 '20 21:06 albertz

For reference, with LSTMBlockCell, I also got this inf detection once (but not sure whether it really originates from there):

  (1) Invalid argument: gradients/s/LSTMBlockCell_1_grad/LSTMBlockCellGrad:1 : Tensor had Inf values
         [[node add_check_numerics_ops/gradients/s/LSTMBlockCell_1_grad/LSTMBlockCellGrad_check_numerics_1 (defined at /work/tools/asr/python/3.8.0_tf_1.15-generic+cuda10.1/lib/python3.8/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]

Some code links for reference:

Op definitions (REGISTER_OP) of LSTMBlockCell etc here (v1.14.0). E.g. LSTMBlockCellGrad:1 corresponds to .Output("dicfo: T").
Kernel definitions (REGISTER_KERNEL_BUILDER) of LSTMBlockCell etc here (v1.14.0). LSTMBlockCellOp calls functor::LSTMBlockCellFprop<Device, T, USE_CUBLAS>. LSTMBlockCellGradOp does not use matmul. It wraps functor::LSTMBlockCellBprop<Device, T, USE_CUBLAS>. It also implements LSTMBlockCellBprop<CPUDevice, T, false /* USE_CUBLAS */> which calls LSTMBlockCellBpropWithEigen.
GPU/CUDA implementations, e.g. LSTMBlockCellBprop<GPUDevice,T,true> etc here (v1.14.0). LSTMBlockCellFpropWithCUDA. Calls TensorBlasGemm. LSTMBlockCellBprop<<GPUDevice, T, true /* USE_CUBLAS */> just calls LSTMBlockCellBpropWithCUDA, and that launches the CUDA kernel lstm_gates_bprop.
Blas Gemm header here (v1.14.0). Defines TensorBlasGemm, and TensorBlasGemm<Device, T, true /* USE_CUBLAS */> calls TensorCuBlasGemm.
CUDA Blas Gemm call for LSTMBlockCell here (v1.14.0). Defines TensorCuBlasGemm. Calls ctx->op_device_context()->stream()->ThenBlasGemm.
Original MatMul op kernel implementation here (v1.15.0). LaunchMatMul<GPUDevice, T, true /* USE_CUBLAS */> does stream->ThenBlasGemm or many other possible variants.

Jun 14 '20 10:06 albertz

Hi, I have a problem when training the setup on librispeech: pretrain epoch 11, step 628, cost:ctc nan, cost:output/output_prob nan, error:ctc 0.9795918166637421, error:decision 0.0, error:output/output_prob 0.9795918166637421, loss nan, max_size:classes 55, max_size:data 1623, mem_usage:GPU:0 3.8GB, num_seqs 12, 0.858 sec/step, elapsed 0:08:40, exp. remaining 0:05:55, complete 59.41% pretrain epoch 11, step 629, cost:ctc nan, cost:output/output_prob nan, error:ctc 0.9792060726322233, error:decision 0.0, error:output/output_prob 0.9792060726322233, loss nan, max_size:classes 59, max_size:data 1709, mem_usage:GPU:0 3.8GB, num_seqs 11, 0.888 sec/step, elapsed 0:08:41, exp. remaining 0:05:54, complete 59.49% But if I do the same on CPU there are no any problems. I used TF 1.15, CUDA 10.0, Python 3.6 on my own computer. I attach the log and log for the run on CPU

Jun 15 '20 14:06 valerie94

@valerie94 Can you try TF 1.14 and TF 2.2 and report whether you see the same problem?

Jun 15 '20 15:06 albertz

@valerie94 Can you try TF 1.14 and TF 2.2 and report whether you see the same problem?

@albertz This problem doesn't occur when I use TF 1.14 and TF 2.2. In my case it only happens with TF 1.15.

Jun 19 '20 13:06 valerie94

Thanks, good to know. So we can just recommend to use TF 1.14 and >=TF 2.2.

Jun 19 '20 14:06 albertz

returnn returnn copied to clipboard

Training instability (Inf/nan score) with TensorFlow 1.15

returnn
returnn copied to clipboard