keras
keras copied to clipboard
GRU + Large `recurrent_dropout` Bug
Keras version: 3.5.0
Backend: TensorFlow 2.17.0
I encountered a strange bug when working with the GRU layer. If you create a simple model with a GRU layer and set recurrent_dropout=0.5, very strange behavior occurs:
- With sequence length 20: Everything works as expected.
- With sequence length 100: The output of the GRU layer during training, with the default
tanhactivation, produces very large values in the range of ±1e25, even though it should be constrained to [-1, 1]. This results in an extremely large loss. - With sequence length 145: The behavior is unstable. I received the following warning:
Object was never used (type <class 'tensorflow.python.ops.tensor_array_ops.TensorArray'>):
<tensorflow.python.ops.tensor_array_ops.TensorArray object at 0x7f6311231eb0>
If you want to mark it as used call its "mark_used()" method.
It was originally created here:
File "/home/nokados/.pyenv/versions/qaclassifiers39-pdm/lib/python3.9/site-packages/keras/src/backend/tensorflow/rnn.py", line 419, in <genexpr>
ta.write(ta_index_to_write, out) File "/home/nokados/.pyenv/versions/qaclassifiers39-pdm/lib/python3.9/site-packages/tensorflow/python/util/tf_should_use.py", line 288, in wrapped
return _add_should_use_warning(fn(*args, **kwargs),
I was unable to reproduce this behavior in Colab; there, either the loss becomes inf, or it behaves similarly to the longer sequence lengths.
3. With sequence length 200: It throws an error:
Epoch 1/50
2024-09-21 22:10:35.493005: I tensorflow[/core/framework/local_rendezvous.cc:404](http://localhost:8888/core/framework/local_rendezvous.cc#line=403)] Local rendezvous is aborting with status: INVALID_ARGUMENT: indices[0] = 2648522 is not in [0, 25601)
---------------------------------------------------------------------------
InvalidArgumentError Traceback (most recent call last)
Cell In[15], line 1
----> 1 model.fit(
2 dataset, onehot_target,
3 batch_size=128,
4 epochs=50,
5 )
File [~/.pyenv/versions/qaclassifiers39-pdm/lib/python3.9/site-packages/keras/src/utils/traceback_utils.py:122](http://localhost:8888/lab/tree/qaclassifiers/notebooks/.pyenv/versions/qaclassifiers39-pdm/lib/python3.9/site-packages/keras/src/utils/traceback_utils.py#line=121), in filter_traceback.<locals>.error_handler(*args, **kwargs)
119 filtered_tb = _process_traceback_frames(e.__traceback__)
120 # To get the full stack trace, call:
121 # `keras.config.disable_traceback_filtering()`
--> 122 raise e.with_traceback(filtered_tb) from None
123 finally:
124 del filtered_tb
File [~/.pyenv/versions/qaclassifiers39-pdm/lib/python3.9/site-packages/keras/src/backend/tensorflow/sparse.py:136](http://localhost:8888/lab/tree/qaclassifiers/notebooks/.pyenv/versions/qaclassifiers39-pdm/lib/python3.9/site-packages/keras/src/backend/tensorflow/sparse.py#line=135), in indexed_slices_union_indices_and_values.<locals>.values_for_union(indices_expanded, indices_count, values)
132 to_union_indices = tf.gather(indices_indices, union_indices)
133 values_with_leading_zeros = tf.concat(
134 [tf.zeros((1,) + values.shape[1:], values.dtype), values], axis=0
135 )
--> 136 return tf.gather(values_with_leading_zeros, to_union_indices)
InvalidArgumentError: {{function_node __wrapped__GatherV2_device_[/job](http://localhost:8888/job):localhost[/replica:0](http://localhost:8888/replica#line=-1)[/task:0](http://localhost:8888/task#line=-1)[/device](http://localhost:8888/device):CPU:0}} indices[0] = 2648522 is not in [0, 25601) [Op:GatherV2] name:
Key points:
- This issue only occurs with GRU. LSTM works fine.
recurrent_dropout=0.5. It works fine with smallerrecurrent_dropoutvalues, such as 0.1.
Irrelevant factors:
- Initialization does not affect the outcome.
- The optimizer only slightly affects the behavior: I observed the errors with
rmsprop;adamdid not throw errors but resulted inloss = nan. - Regular
dropoutdoes not affect the issue.
I have prepared a minimal reproducible example in Colab. Here is the link: https://colab.research.google.com/drive/1msGuYB5E_eg_IIU_YK4cJcWrkEm3o0NL?usp=sharing.
Hi @nokados -
Here as the points you mentioned:
-
With sequence length 100: Here model is simple and input_sequence_length is more. Addition of dense layer after GRU with
tanhactivation function to increase model complexity will reduce the loss. -
With sequence length 145: Here also along with additional dense layer need to increase units of GRU layer and dense layer will generate good result with accuracy and loss.
-
With sequence length 200: Here we need to use Adam optimizer with learning_rate and recurrent_dropout=0.2 with same units and layer which use with sequence length 145 will give proper training without error. Here we increase model_complexity with GRU layer so rmsprop is not more adaptive.
And reducing recurrent_dropout is leading to underfitting so reducing the recurrent_dropout will get learn pattern of input data easily.
Here the gist shows all the changes mentioned with different sequence length. Let me know anything more required...!!!
This seems more like a workaround than a solution to the original problem. Adding an extra layer with a tangent function doesn't address the issue but merely "hides" the enormous outputs from the GRU, restricting them to the range of -1 to 1 with the tangent. However, the problem is that these values should already be in this range after the GRU, as tanh is already built into it. Mathematically, it shouldn't produce values like -2.5e25. The same behavior is expected from Keras as well.
@nokados ,
The output of the GRU layer during training, with the default tanh activation, produces very large values in the range of ±1e25, even though it should be constrained to [-1, 1]. This results in an extremely large loss.
The tanh activation is not applied to the actual output of GRU, it's applied to intermediate calculations. The output of the GRU can be outside of [-1, 1], there's nothing that prevents that.
If you create a simple model with a GRU layer and set recurrent_dropout=0.5, very strange behavior occurs:
With sequence length 20: Everything works as expected. With sequence length 100: The output of the GRU layer during training [...] produces very large values in the range of ±1e25
What happens is that the recurrent_dropout is applied on intermediate state for each item in the sequence. So with a sequence length of 100, the recurrent_dropout of 0.5 is applied a hundred times. Almost all the state gets dropped, to the point that the math becomes meaningless and the model cannot learn.
To avoid this, you have to adapt the recurrent_dropout to the sequence length. A recurrent_dropout of 0.5 may be ok for a sequence length of 20, but as you experimented, with a sequence length of 100, a recurrent_dropout of 0.1 is probably more adapted.
A. Improper behavior:
- It has worked until keras 3
- It works well with LSTM
- Let's look at the GRU math:
Update gate:
$$ z_t = \sigma(W_z \cdot x_t + U_z \cdot h_{t-1} + b_z) \ $$
From 0 to 1.
Reset gate:
$$ r_t = \sigma(W_r \cdot x_t + U_r \cdot h_{t-1} + b_r) $$
From 0 to 1.
Candidate hidden state:
$$ \tilde{h}_t = \tanh(W_h \cdot x_t + U_h \cdot (r_t \odot h_{t-1}) + b_h) $$
From -1 to 1.
New hidden state:
$$ h_t = (1 - z_t) \odot h_{t-1} + z_t \odot \tilde{h}_t $$
- $1 - z_t$ ranges from 0 to 1.
- $(1 - z_t) \odot h_{t-1}$ ranges from 0 to $h_{t-1}$.
- $z_t \odot \tilde{h}_t$ ranges from -1 to 1.
Correct?
At each recurrent step the maximum difference between $h_{t-1}$ and $h_t$ weigths is 1. So after 100 steps, h_t should be less than 100 $\pm$ $h_0$ that is 0. Practically, they are in [-0.1, 0.1] range without recurrent dropout.
This behavior remains the same for the model before fitting, so we can ignore the trainable weights for now.
Also, look at the relationship:
How is dropout applied? I am not sure, but I guess it happens under tanh at the $\tilde{h}_t$ calculations, so it shouldn't impact on the output limits.
B. Other problems
What about exceptions? Is it okay if too large a recurrent_dropout causes indices[0] = 2648522 is not in [0, 25601) [Op:GatherV2]. Could this be a memory issue?