keras Accuracy Drop Across TensorFlow Versions When Using Keras 3 Instead of Keras 2

Hi,

As mentioned here, a drop in model results was observed in the new version of TensorFlow using the new Keras. I tracked some tests using tensorflow with keras and tf-keras as well here

I'm using a CNN+BLSTM+CTC for handwriting recognition tests, and using only the APIs:

(tf.)keras.layers.BatchNormalization
(tf.)keras.layers.Bidirectional
(tf.)keras.layers.Conv2D
(tf.)keras.layers.Dense
(tf.)keras.layers.Dropout
(tf.)keras.layers.LSTM
(tf.)keras.layers.MaxPooling2D
(tf.)keras.layers.PReLU
(tf.)keras.Model
(tf.)keras.optimizers.RMSprop

So I expanded the tests again and tried to find the root of the problem by looking for the differences in default parameters between keras and tf.keras. Below are the tests I did.

tensorflow                                Results                  Comment
2.15.1                                    CER ~4.6 and WER ~12.5   no messages
2.16.2                                    CER ~5.7 and WER ~15.4   PTX messages
2.16.2             + TF_USE_LEGACY_KERAS  CER ~4.7 and WER ~12.0   PTX messages
2.17.0rc0                                 CER ~5.8 and WER ~15.6   numa and gpu_timer messages
2.17.0rc0          + TF_USE_LEGACY_KERAS  CER ~4.8 and WER ~13.0   numa and gpu_timer messages
2.17.0rc1                                 CER ~5.2 and WER ~14.8   numa and gpu_timer messages
2.17.0rc1          + TF_USE_LEGACY_KERAS  CER ~4.8 and WER ~13.2   numa and gpu_timer messages
2.17.0                                    CER ~5.4 and WER ~16.0   numa and gpu_timer messages
2.17.0             + TF_USE_LEGACY_KERAS  CER ~4.6 and WER ~12.6   numa and gpu_timer messages

tf-nightly
2.18.0.dev20240710                        CER ~5.6 and WER ~15.8   numa and gpu_timer messages

only tf-keras
2.17.0rc0                                 CER ~4.7 and WER ~12.8   numa and gpu_timer messages

only keras
2.15.0                                    CER ~4.7 and WER ~12.4   numa and gpu_timer messages
3.4.1                                     CER ~5.6 and WER ~15.4   numa and gpu_timer messages

only keras-nightly
3.4.1.dev2024071503                       CER ~5.5 and WER ~14.5   numa and gpu_timer messages

Recently, I found some implementation differences regarding the LSTM layer dropout (Keras3 and Keras2). So I tested some variations using tensorflow with keras 3:

# worst results
[...]
tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(units=128, dropout=0.5, return_sequences=True))
[...]

# worst results
[...]
decoder1 = tf.keras.layers.LSTM(units=128, dropout=0.5, go_backwards=False, return_sequences=True)(decoder)
decoder2 = tf.keras.layers.LSTM(units=128, dropout=0.5, go_backwards=True, return_sequences=True)(decoder)

decoder = tf.keras.layers.Concatenate()([decoder1, decoder2])
[...]

# expected results!
[...]
decoder1 = tf.keras.layers.LSTM(units=128, dropout=0.0, go_backwards=False, return_sequences=True)(decoder)
decoder1 = tf.keras.layers.Dropout(rate=0.5)(decoder1)

decoder2 = tf.keras.layers.LSTM(units=128, dropout=0.0, go_backwards=True, return_sequences=True)(decoder)
decoder2 = tf.keras.layers.Dropout(rate=0.5)(decoder2)

decoder = tf.keras.layers.Concatenate()([decoder1, decoder2])
[...]

So using dropout outside the LSTM layer, the model achieves the expected results.

Jul 15 '24 17:07 arthurflor23

Thanks for the bug! Definitely seems plausible, the dropout code for RNNs changed substantially between Keras 2 and 3.

Do you have any minimal reproductions you could share?

Jul 18 '24 02:07 mattdangerw

Do you have any minimal reproductions you could share?

Unfortunately, I can't share any right now.

Just the mentioned

tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(units=128, dropout=0.5, return_sequences=True))

Against

decoder1 = tf.keras.layers.LSTM(units=128, dropout=0.0, go_backwards=False, return_sequences=True)(decoder)
decoder1 = tf.keras.layers.Dropout(rate=0.5)(decoder1)

decoder2 = tf.keras.layers.LSTM(units=128, dropout=0.0, go_backwards=True, return_sequences=True)(decoder)
decoder2 = tf.keras.layers.Dropout(rate=0.5)(decoder2)

decoder = tf.keras.layers.Concatenate()([decoder1, decoder2])

Jul 18 '24 02:07 arthurflor23

Thanks! I'll see if I can repro with some simple sequence data. Hopefully it's easy enough to tell where the dropout mask is going wrong just by comparing the two.

Jul 18 '24 05:07 mattdangerw

@arthurflor23 actually a little more info would be useful. Can you tell if you are using the GPU fused ops for LSTMs?

https://github.com/keras-team/tf-keras/blob/8eb5627387f83c2ebd31b31aad92f42ad0a8dcdb/tf_keras/layers/rnn/lstm.py#L1007-L1010

https://github.com/keras-team/keras/blob/0ed820f5649bcb27531d73cfc023763712fc8bf9/keras/src/backend/tensorflow/rnn.py#L861-L865

Might need to step in with a debugger, or add a print in the function to see if it's actually getting run.

Jul 19 '24 00:07 mattdangerw

@fchollet it looks to me like these lines are not equivalent with the Keras 2 version, but I am not sure.

https://github.com/keras-team/keras/blob/0ed820f5649bcb27531d73cfc023763712fc8bf9/keras/src/layers/rnn/lstm.py#L523-L529

I see how that's trying to imitate a cached dropout that is the same at each time step. But as far as I can tell, in Keras 2 on GPU the dropout is just computed once for the sequence without this broadcasting trick. Am I reading this wrong?

Jul 19 '24 01:07 mattdangerw

@mattdangerw I think you're right -- Keras 2 was applying input dropout randomly across timesteps (while using a temporally constant mask for recurrent dropout), while Keras 3 is using a temporally constant dropout mask for both input dropout and recurrent dropout. So there is a behavior difference.

The Keras 3 implementation is consistent with Deep Learning with Python, quoting:

the same dropout mask (the same pattern of dropped units) should be applied at every timestep, instead of a dropout mask that varies randomly from timestep to timestep. What's more, in order to regularize the representations formed by the recurrent gates of layers such as GRU and LSTM, a temporally constant dropout mask should be applied to the inner recurrent activations of the layer (a recurrent dropout mask). Using the same dropout mask at every timestep allows the network to properly propagate its learning error through time; a temporally random dropout mask would disrupt this error signal and be harmful to the learning process.

If you want to replicate the Keras 2 behavior you can simply not pass dropout= to the LSTM layer, and instead add a Dropout layer before the LSTM layer.

If you think we should change the Keras 3 implementation, let me know.

Jul 19 '24 18:07 fchollet

@arthurflor23 actually a little more info would be useful. Can you tell if you are using the GPU fused ops for LSTMs?

https://github.com/keras-team/tf-keras/blob/8eb5627387f83c2ebd31b31aad92f42ad0a8dcdb/tf_keras/layers/rnn/lstm.py#L1007-L1010

https://github.com/keras-team/keras/blob/0ed820f5649bcb27531d73cfc023763712fc8bf9/keras/src/backend/tensorflow/rnn.py#L861-L865

Might need to step in with a debugger, or add a print in the function to see if it's actually getting run.

@mattdangerw I added a print in the functions and both are called in training (keras and tf-keras).

Jul 22 '24 16:07 arthurflor23

If you want to replicate the Keras 2 behavior you can simply not pass dropout= to the LSTM layer, and instead add a Dropout layer before the LSTM layer.

I couldn't replicate the expected results using Dropout before BLSTM and also LSTM (forward and backward) in Keras 3. However, somehow using TimeDistributed(Dropout) before BLSTM/LSTM worked. The tradeoff is the increased computational cost. Does that make any sense?

Anyway, I understand that the current implementation is correct through the presented concept, so let me know what you think @mattdangerw, if so, I could close the issue.

Aug 06 '24 10:08 arthurflor23

I finally managed to replicate the same behavior between Keras versions. For those who got here, two key points: apply dropout before each LSTM layer (both forward and backward) and use a flip operation after the backward call (I didn't know that):

 forwards = tf.keras.layers.Dropout(rate=0.5)(decoder)
 forwards = tf.keras.layers.LSTM(units=128, return_sequences=True, go_backwards=False)(forwards)

 backwards = tf.keras.layers.Dropout(rate=0.5)(decoder)
 backwards = tf.keras.layers.LSTM(units=128, return_sequences=True, go_backwards=True)(backwards)
 backwards = tf.keras.ops.flip(backwards, axis=1)

 decoder = tf.keras.layers.Concatenate(axis=-1)([forwards, backwards])

I tried simplifying by placing a dropout before the Bidirectional LSTM layer, but it resulted in different behavior. So, it's necessary to include dropouts before both the forward and backward LSTM layers.

Here, I implemented a custom Bidirectional layer with an additional dropout parameter that applies dropout before each layer. With this, it preserves both dropout approaches: within the LSTM layer (as in Keras 3) and before the LSTM layer (as in Keras 2).

Aug 09 '24 11:08 arthurflor23

Bidirectional layer with an additional dropout parameter that applies dropout before each layer. With this, it preserves both dropout approaches: within the LSTM layer (as in Keras 3) and before the LSTM layer (as in Keras 2).

Another suggestion would be to add a noise_shape=None parameter to the LSTM layer, just like the Dropout layer API has, allowing the flexibility of using the binary dropout mask.

Sep 05 '24 18:09 arthurflor23

Hi @arthurflor23 - Could you please confirm if this issue still persists with the latest version of Keras(3.11.3). If so, could you share a minimal reproducible code? Thanks!

Oct 08 '25 13:10 sonali-kumari1

Hi @sonali-kumari1 , I'm currently traveling, but as of the last time I tested, the behavior still persists.

Actually, this is not really an issue, but rather a consequence of how things were implemented in Keras 3. In my view, the current implementation is not really suitable for OCR/HTR.

For example, based on the paper https://ieeexplore.ieee.org/abstract/document/6981034, the dropout applied in Keras 3 does not reproduce the results. This only works as expected in Keras 2, or doing the manual implementation mentioned.

In any case, this seems more like a design choice by the development team than an actual issue. Feel free to consider this thread concluded.

Oct 09 '25 20:10 arthurflor23

@arthurflor23 Thanks for the clarification and the reference. I'm closing this issue, as it appears to be a design choice rather than a bug. Feel free to open a new issue if you'd like to revisit this in the future. Thanks!

Oct 13 '25 09:10 sonali-kumari1

Are you satisfied with the resolution of your issue? Yes No

Oct 13 '25 09:10 google-ml-butler[bot]

keras keras copied to clipboard

Accuracy Drop Across TensorFlow Versions When Using Keras 3 Instead of Keras 2

keras
keras copied to clipboard