keras icon indicating copy to clipboard operation
keras copied to clipboard

Update trainer.py

Open TheMGGdev opened this issue 1 year ago • 10 comments

Cleaning each time you compile the model the previous trainer metrics also erased the compilation of other models with are a cominbation of that model. Given problems in archiquectures shuch as GAN. As explain in issue:

TheMGGdev avatar Nov 08 '24 17:11 TheMGGdev

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

google-cla[bot] avatar Nov 08 '24 17:11 google-cla[bot]

The issue is: https://github.com/keras-team/keras/issues/20474

TheMGGdev avatar Nov 08 '24 17:11 TheMGGdev

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 59.94%. Comparing base (8409e18) to head (848fdde).

:exclamation: There is a different number of reports uploaded between BASE (8409e18) and HEAD (848fdde). Click for more details.

HEAD has 6 uploads less than BASE
Flag BASE (8409e18) HEAD (848fdde)
keras 4 1
keras-torch 1 0
keras-tensorflow 1 0
keras-jax 1 0
Additional details and impacted files
@@             Coverage Diff             @@
##           master   #20473       +/-   ##
===========================================
- Coverage   82.07%   59.94%   -22.14%     
===========================================
  Files         515      515               
  Lines       47416    47415        -1     
  Branches     7439     7439               
===========================================
- Hits        38919    28421    -10498     
- Misses       6691    17246    +10555     
+ Partials     1806     1748       -58     
Flag Coverage Δ
keras 59.94% <ø> (-22.00%) :arrow_down:
keras-jax ?
keras-numpy 59.94% <ø> (-0.03%) :arrow_down:
keras-tensorflow ?
keras-torch ?

Flags with carried forward coverage won't be shown. Click here to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov-commenter avatar Nov 08 '24 17:11 codecov-commenter

Thanks for the PR!

@hertschuh, @jeffcarp, do you remember why this line was inserted? I believe it was intended to fix a bug, however removing it does not seem to break any test.

If we do remove the line, then we need a test for the use case it was breaking.

fchollet avatar Nov 09 '24 20:11 fchollet

Thanks for the PR!

@hertschuh, @jeffcarp, do you remember why this line was inserted? I believe it was intended to fix a bug, however removing it does not seem to break any test.

If we do remove the line, then we need a test for the use case it was breaking.

I don't have any context on this.

This is the PR that added this line to fix some bug: https://github.com/keras-team/keras/pull/20197

cc @james77777778

hertschuh avatar Nov 10 '24 02:11 hertschuh

@fchollet @hertschuh I'm looking into it and will report back soon.

james77777778 avatar Nov 10 '24 08:11 james77777778

@fchollet I believe the change of this PR should be fine. The drawback is that there might be some unused metrics in the sub-trainers.

Previously, that line was necessary to prevent an error caused by the nested Trainer arch. Here’s a reproducible snippet:

import keras

train_images = keras.random.uniform(shape=(32, 784))
train_labels = keras.random.randint(shape=(32, 1), minval=0, maxval=9)

trainer1 = keras.Sequential(
    [keras.layers.Input(shape=(784,)), keras.layers.Dense(1, activation="relu")]
)
trainer1.compile(loss="mse", metrics=["mse"])  # Not calling `build` for `trainer1`'s `CompileLoss`

# Create another model.
inputs = keras.Input(shape=(784,))
x = trainer1(inputs)
outputs = keras.layers.Dense(10)(x)
trainer2 = keras.Model(inputs=inputs, outputs=outputs)
trainer2.compile(loss="binary_crossentropy", metrics=["accuracy"])
trainer2.fit(
    train_images, keras.utils.to_categorical(train_labels, 10), epochs=2
)
# `fit` might fail because `trainer1`'s `CompileLoss` is not built

However, there is logic that skips the metrics for the sub-trainer: https://github.com/keras-team/keras/blob/8deee17166a2f36685a9b51d8ddc6dfee2db1298/keras/src/trainers/trainer.py#L264-L267 This allows us to keep the metrics in the sub-trainer without encountering the unbuilt issue.

We might want to consider adding a similar test as shown above to prevent future breakages.

james77777778 avatar Nov 10 '24 15:11 james77777778

Thank you, @james77777778 and @hertschuh !

@TheMGGdev are you able to add a basic, minimal unit test for your use case, so we avoid breaking it in the future?

We should also include a test based on @james77777778's snippet above.

fchollet avatar Nov 10 '24 18:11 fchollet

I don't know if this is exactly what you are asking for. Here is a more simplified example of the issue 20474. In Keras 3.6 it gives error but in Keras 3 it works. In the issue I explain better the error and where I think the error is. This is the code with the prints for debugging:

import numpy as np

from keras import __version__
from keras.models import Sequential, Model
from keras.layers import Input, Dense
from keras.optimizers import Adam

print(__version__)

model_1 = Sequential([
            Input(shape = (100, )),
            Dense(100, activation = "sigmoid"),
        ],)
model_2 = Sequential([
            Input(shape = (100, )),
            Dense(80, activation = "sigmoid"),
        ],)

model_1.compile(loss = 'binary_crossentropy', optimizer = Adam(), metrics = ['accuracy'])

###Print for debugging/show the error 
print('---Debugging/show the error after compiled model_1 and before compiled combined---')
print('model_1.compiled ->', model_1.compiled)
print('model_1.metrics ->', model_1.metrics)
print('model_1._loss_tracker ->', model_1._loss_tracker)
###

combined = Model(Input(shape=(100,)), model_2(model_1(Input(shape=(100,)))))
combined.compile(loss = 'binary_crossentropy', optimizer = Adam())

###Print for debugging/show the error
print('---Debugging/show the error after compiled model_1 and combined---')
print('model_1.compiled ->', model_1.compiled)
print('model_1.metrics ->', model_1.metrics)
print('model_1._loss_tracker ->', model_1._loss_tracker)
###

model_1.train_on_batch(np.random.normal(0, 1, (64, 100)), np.random.normal(0, 1, (64, 100)))
combined.train_on_batch(np.random.normal(0, 1, (64, 100)), np.random.normal(0, 1, (64, 80)))

And this is the code without the prints for the test:

import numpy as np

from keras import __version__
from keras.models import Sequential, Model
from keras.layers import Input, Dense
from keras.optimizers import Adam

print(__version__)

model_1 = Sequential([
            Input(shape = (100, )),
            Dense(100, activation = "sigmoid"),
        ],)
model_2 = Sequential([
            Input(shape = (100, )),
            Dense(80, activation = "sigmoid"),
        ],)

model_1.compile(loss = 'binary_crossentropy', optimizer = Adam(), metrics = ['accuracy'])

combined = Model(Input(shape=(100,)), model_2(model_1(Input(shape=(100,)))))
combined.compile(loss = 'binary_crossentropy', optimizer = Adam())

model_1.train_on_batch(np.random.normal(0, 1, (64, 100)), np.random.normal(0, 1, (64, 100)))
combined.train_on_batch(np.random.normal(0, 1, (64, 100)), np.random.normal(0, 1, (64, 80)))

This is the output in Keras 3.6: aux

TheMGGdev avatar Nov 12 '24 09:11 TheMGGdev

I think that the solution for both PR is to clean only the layers of the models that don´t have compiled == True, so the one that have not been compiled already. This way it should work for both cases

TheMGGdev avatar Nov 12 '24 09:11 TheMGGdev

Any update in this pull request?

TheMGGdev avatar Nov 27 '24 11:11 TheMGGdev

Are you able to provide a unit test?

fchollet avatar Nov 27 '24 17:11 fchollet

What exactly are you asking for, I guess what I already sent doesn't work, does it?

TheMGGdev avatar Nov 27 '24 17:11 TheMGGdev

The code snippet you posted (2nd one) can be turned into an appropriate unit test I think.

fchollet avatar Nov 27 '24 17:11 fchollet

@TheMGGdev Francois is just asking to adapt the code above into a unit tests--an addition to a *_test.py that will run automatically against every future code change. That's a general expectation for us with bug fixes so we don't break ourselves in the future. See this recent PR as a arbitrary example https://github.com/keras-team/keras/pull/20550

Are you able to add a unit test here?

mattdangerw avatar Nov 27 '24 20:11 mattdangerw

And this is the code without the prints for the test:

It seems the code snippet breaks in 3.7.0 with the proposed change.

3.7.0
2024-12-03 14:35:41.341850: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Traceback (most recent call last):
  File "D:\OSS\Keras\keras\proposed_fix(test).py", line 89, in <module>
    multi_compile_test()
  File "D:\OSS\Keras\keras\proposed_fix(test).py", line 87, in multi_compile_test
    combined.train_on_batch(np.random.normal(0, 1, (64, 100)), np.random.normal(0, 1, (64, 80)))
  File "D:\OSS\Keras\keras\keras\src\backend\tensorflow\trainer.py", line 598, in train_on_batch
    logs = self.train_function(data())
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\OSS\Keras\keras\keras\src\backend\tensorflow\trainer.py", line 224, in function
    outputs = one_step_on_data(data)
              ^^^^^^^^^^^^^^^^^^^^^^
  File "D:\OSS\Keras\.venv\Lib\site-packages\tensorflow\python\util\traceback_utils.py", line 153, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "D:\OSS\Keras\keras\keras\src\backend\tensorflow\trainer.py", line 110, in one_step_on_data
    outputs = self.distribute_strategy.run(step_function, args=(data,))
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\OSS\Keras\keras\keras\src\backend\tensorflow\trainer.py", line 56, in train_step
    y_pred = self(x, training=True)
             ^^^^^^^^^^^^^^^^^^^^^^
  File "D:\OSS\Keras\keras\keras\src\utils\traceback_utils.py", line 122, in error_handler
    raise e.with_traceback(filtered_tb) from None
        ^^^^^^^^^^^
  File "D:\OSS\Keras\keras\keras\src\ops\function.py", line 179, in _run_through_graph
    output_tensors.append(tensor_dict[id(x)])
                          ~~~~~~~~~~~^^^^^^^
KeyError: 'Exception encountered when calling Functional.call().\n\n\x1b[1m2425813381776\x1b[0m\n\nArguments received by Functional.call():\n  • inputs=tf.Tensor(shape=(64, 100), dtype=float64)\n  • training=True\n  • mask=None'

Surya2k1 avatar Dec 03 '24 09:12 Surya2k1

Hi @TheMGGdev Any update on this PR? Please.

gbaned avatar Jan 08 '25 09:01 gbaned

This issue seems to be resolved in Keras 3.8, at least in my case.

yoavram avatar Feb 09 '25 15:02 yoavram

This was addressed as part of https://github.com/keras-team/keras/pull/20602

hertschuh avatar Aug 08 '25 18:08 hertschuh