keras
keras copied to clipboard
Update trainer.py
Cleaning each time you compile the model the previous trainer metrics also erased the compilation of other models with are a cominbation of that model. Given problems in archiquectures shuch as GAN. As explain in issue:
Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).
View this failed invocation of the CLA check for more information.
For the most up to date status, view the checks section at the bottom of the pull request.
The issue is: https://github.com/keras-team/keras/issues/20474
Codecov Report
All modified and coverable lines are covered by tests :white_check_mark:
Project coverage is 59.94%. Comparing base (
8409e18) to head (848fdde).
:exclamation: There is a different number of reports uploaded between BASE (8409e18) and HEAD (848fdde). Click for more details.
HEAD has 6 uploads less than BASE
Flag BASE (8409e18) HEAD (848fdde) keras 4 1 keras-torch 1 0 keras-tensorflow 1 0 keras-jax 1 0
Additional details and impacted files
@@ Coverage Diff @@
## master #20473 +/- ##
===========================================
- Coverage 82.07% 59.94% -22.14%
===========================================
Files 515 515
Lines 47416 47415 -1
Branches 7439 7439
===========================================
- Hits 38919 28421 -10498
- Misses 6691 17246 +10555
+ Partials 1806 1748 -58
| Flag | Coverage Δ | |
|---|---|---|
| keras | 59.94% <ø> (-22.00%) |
:arrow_down: |
| keras-jax | ? |
|
| keras-numpy | 59.94% <ø> (-0.03%) |
:arrow_down: |
| keras-tensorflow | ? |
|
| keras-torch | ? |
Flags with carried forward coverage won't be shown. Click here to find out more.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
Thanks for the PR!
@hertschuh, @jeffcarp, do you remember why this line was inserted? I believe it was intended to fix a bug, however removing it does not seem to break any test.
If we do remove the line, then we need a test for the use case it was breaking.
Thanks for the PR!
@hertschuh, @jeffcarp, do you remember why this line was inserted? I believe it was intended to fix a bug, however removing it does not seem to break any test.
If we do remove the line, then we need a test for the use case it was breaking.
I don't have any context on this.
This is the PR that added this line to fix some bug: https://github.com/keras-team/keras/pull/20197
cc @james77777778
@fchollet @hertschuh I'm looking into it and will report back soon.
@fchollet I believe the change of this PR should be fine. The drawback is that there might be some unused metrics in the sub-trainers.
Previously, that line was necessary to prevent an error caused by the nested Trainer arch.
Here’s a reproducible snippet:
import keras
train_images = keras.random.uniform(shape=(32, 784))
train_labels = keras.random.randint(shape=(32, 1), minval=0, maxval=9)
trainer1 = keras.Sequential(
[keras.layers.Input(shape=(784,)), keras.layers.Dense(1, activation="relu")]
)
trainer1.compile(loss="mse", metrics=["mse"]) # Not calling `build` for `trainer1`'s `CompileLoss`
# Create another model.
inputs = keras.Input(shape=(784,))
x = trainer1(inputs)
outputs = keras.layers.Dense(10)(x)
trainer2 = keras.Model(inputs=inputs, outputs=outputs)
trainer2.compile(loss="binary_crossentropy", metrics=["accuracy"])
trainer2.fit(
train_images, keras.utils.to_categorical(train_labels, 10), epochs=2
)
# `fit` might fail because `trainer1`'s `CompileLoss` is not built
However, there is logic that skips the metrics for the sub-trainer: https://github.com/keras-team/keras/blob/8deee17166a2f36685a9b51d8ddc6dfee2db1298/keras/src/trainers/trainer.py#L264-L267 This allows us to keep the metrics in the sub-trainer without encountering the unbuilt issue.
We might want to consider adding a similar test as shown above to prevent future breakages.
Thank you, @james77777778 and @hertschuh !
@TheMGGdev are you able to add a basic, minimal unit test for your use case, so we avoid breaking it in the future?
We should also include a test based on @james77777778's snippet above.
I don't know if this is exactly what you are asking for. Here is a more simplified example of the issue 20474. In Keras 3.6 it gives error but in Keras 3 it works. In the issue I explain better the error and where I think the error is. This is the code with the prints for debugging:
import numpy as np
from keras import __version__
from keras.models import Sequential, Model
from keras.layers import Input, Dense
from keras.optimizers import Adam
print(__version__)
model_1 = Sequential([
Input(shape = (100, )),
Dense(100, activation = "sigmoid"),
],)
model_2 = Sequential([
Input(shape = (100, )),
Dense(80, activation = "sigmoid"),
],)
model_1.compile(loss = 'binary_crossentropy', optimizer = Adam(), metrics = ['accuracy'])
###Print for debugging/show the error
print('---Debugging/show the error after compiled model_1 and before compiled combined---')
print('model_1.compiled ->', model_1.compiled)
print('model_1.metrics ->', model_1.metrics)
print('model_1._loss_tracker ->', model_1._loss_tracker)
###
combined = Model(Input(shape=(100,)), model_2(model_1(Input(shape=(100,)))))
combined.compile(loss = 'binary_crossentropy', optimizer = Adam())
###Print for debugging/show the error
print('---Debugging/show the error after compiled model_1 and combined---')
print('model_1.compiled ->', model_1.compiled)
print('model_1.metrics ->', model_1.metrics)
print('model_1._loss_tracker ->', model_1._loss_tracker)
###
model_1.train_on_batch(np.random.normal(0, 1, (64, 100)), np.random.normal(0, 1, (64, 100)))
combined.train_on_batch(np.random.normal(0, 1, (64, 100)), np.random.normal(0, 1, (64, 80)))
And this is the code without the prints for the test:
import numpy as np
from keras import __version__
from keras.models import Sequential, Model
from keras.layers import Input, Dense
from keras.optimizers import Adam
print(__version__)
model_1 = Sequential([
Input(shape = (100, )),
Dense(100, activation = "sigmoid"),
],)
model_2 = Sequential([
Input(shape = (100, )),
Dense(80, activation = "sigmoid"),
],)
model_1.compile(loss = 'binary_crossentropy', optimizer = Adam(), metrics = ['accuracy'])
combined = Model(Input(shape=(100,)), model_2(model_1(Input(shape=(100,)))))
combined.compile(loss = 'binary_crossentropy', optimizer = Adam())
model_1.train_on_batch(np.random.normal(0, 1, (64, 100)), np.random.normal(0, 1, (64, 100)))
combined.train_on_batch(np.random.normal(0, 1, (64, 100)), np.random.normal(0, 1, (64, 80)))
This is the output in Keras 3.6:
I think that the solution for both PR is to clean only the layers of the models that don´t have compiled == True, so the one that have not been compiled already. This way it should work for both cases
Any update in this pull request?
Are you able to provide a unit test?
What exactly are you asking for, I guess what I already sent doesn't work, does it?
The code snippet you posted (2nd one) can be turned into an appropriate unit test I think.
@TheMGGdev Francois is just asking to adapt the code above into a unit tests--an addition to a *_test.py that will run automatically against every future code change. That's a general expectation for us with bug fixes so we don't break ourselves in the future. See this recent PR as a arbitrary example https://github.com/keras-team/keras/pull/20550
Are you able to add a unit test here?
And this is the code without the prints for the test:
It seems the code snippet breaks in 3.7.0 with the proposed change.
3.7.0
2024-12-03 14:35:41.341850: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Traceback (most recent call last):
File "D:\OSS\Keras\keras\proposed_fix(test).py", line 89, in <module>
multi_compile_test()
File "D:\OSS\Keras\keras\proposed_fix(test).py", line 87, in multi_compile_test
combined.train_on_batch(np.random.normal(0, 1, (64, 100)), np.random.normal(0, 1, (64, 80)))
File "D:\OSS\Keras\keras\keras\src\backend\tensorflow\trainer.py", line 598, in train_on_batch
logs = self.train_function(data())
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\OSS\Keras\keras\keras\src\backend\tensorflow\trainer.py", line 224, in function
outputs = one_step_on_data(data)
^^^^^^^^^^^^^^^^^^^^^^
File "D:\OSS\Keras\.venv\Lib\site-packages\tensorflow\python\util\traceback_utils.py", line 153, in error_handler
raise e.with_traceback(filtered_tb) from None
File "D:\OSS\Keras\keras\keras\src\backend\tensorflow\trainer.py", line 110, in one_step_on_data
outputs = self.distribute_strategy.run(step_function, args=(data,))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\OSS\Keras\keras\keras\src\backend\tensorflow\trainer.py", line 56, in train_step
y_pred = self(x, training=True)
^^^^^^^^^^^^^^^^^^^^^^
File "D:\OSS\Keras\keras\keras\src\utils\traceback_utils.py", line 122, in error_handler
raise e.with_traceback(filtered_tb) from None
^^^^^^^^^^^
File "D:\OSS\Keras\keras\keras\src\ops\function.py", line 179, in _run_through_graph
output_tensors.append(tensor_dict[id(x)])
~~~~~~~~~~~^^^^^^^
KeyError: 'Exception encountered when calling Functional.call().\n\n\x1b[1m2425813381776\x1b[0m\n\nArguments received by Functional.call():\n • inputs=tf.Tensor(shape=(64, 100), dtype=float64)\n • training=True\n • mask=None'
Hi @TheMGGdev Any update on this PR? Please.
This issue seems to be resolved in Keras 3.8, at least in my case.
This was addressed as part of https://github.com/keras-team/keras/pull/20602