keras icon indicating copy to clipboard operation
keras copied to clipboard

Failure running a SavedModel exported from a tf.Module with a Keras model as an instance variable

Open ivansoban opened this issue 1 year ago • 4 comments
trafficstars

I have been advised by the Tensorflow team to post this issue here. I will restate issue below.

Issue type

Bug

Have you reproduced the bug with TensorFlow Nightly?

No, because the sample code produces a core dump.

Source

binary

TensorFlow version

2.17.0

Custom code

No

OS platform and distribution

Linux Ubuntu 22.04.4 LTS

Mobile device

No response

Python version

3.10.12

Bazel version

No response

GCC/compiler version

No response

CUDA/cuDNN version

No response

GPU model and memory

No response

Current behavior?

Saving a tf.Module using tf.saved_model.save when that class contains a Keras model in an instance variable results in a FAILED_PRECONDITION when run using saved_model_cli or libtensorflow.

In Tensorflow v2.15.0, the behavior is as expected: the graph execution proceeds without any errors and the expected results are produced.

Standalone code to reproduce the issue

import tensorflow as tf
import numpy as np

SHAPE = (1, 5)

class TestModel(tf.Module):
    def __init__(self):
        super().__init__()
        self.dense_layer = tf.keras.layers.Dense(10)

    @tf.function(input_signature=[tf.TensorSpec(shape=SHAPE, dtype=tf.float32)])
    def run(self, x):
        return self.dense_layer(x)


module = TestModel()
sample_input = tf.random.normal(SHAPE, dtype=tf.float32)
module.run(sample_input)

np.save('sample_input.npy', sample_input.numpy())
tf.saved_model.save(module, "test_model")

# # To reproduce, run the following:
# python test.py && saved_model_cli run --dir test_model --tag_set serve --signature_def serving_default --inputs 'x=sample_input.npy'

Relevant log output

2024-08-01 15:25:35.204057: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-08-01 15:25:35.261217: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-08-01 15:25:35.278801: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-08-01 15:25:35.313892: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-08-01 15:25:37.270110: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
2024-08-01 15:25:38.898105: I tensorflow/core/common_runtime/gpu/gpu_device.cc:2021] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 4281 MB memory:  -> device: 0, name: Tesla V100-PCIE-16GB, pci bus id: 0000:3b:00.0, compute capability: 7.0
2024-08-01 15:25:38.898868: I tensorflow/core/common_runtime/gpu/gpu_device.cc:2021] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 944 MB memory:  -> device: 1, name: Tesla V100-PCIE-16GB, pci bus id: 0000:d8:00.0, compute capability: 7.0
WARNING:tensorflow:From /home/iantolic-soban/tf_bug/.venv/lib/python3.10/site-packages/tensorflow/python/tools/saved_model_cli.py:716: load (from tensorflow.python.saved_model.loader_impl) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.saved_model.load` instead.
W0801 15:25:38.903731 139977869341120 deprecation.py:50] From /home/iantolic-soban/tf_bug/.venv/lib/python3.10/site-packages/tensorflow/python/tools/saved_model_cli.py:716: load (from tensorflow.python.saved_model.loader_impl) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.saved_model.load` instead.
INFO:tensorflow:Restoring parameters from test_model/variables/variables
I0801 15:25:38.936800 139977869341120 saver.py:1417] Restoring parameters from test_model/variables/variables
2024-08-01 15:25:38.941206: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:388] MLIR V1 optimization pass is not enabled
2024-08-01 15:25:39.144248: I tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: FAILED_PRECONDITION: Could not find variable dense/bias. This could mean that the variable has been deleted. In TF1, it can also mean the variable is uninitialized. Debug info: container=localhost, status error message=Resource localhost/dense/bias/N10tensorflow3VarE does not exist.
	 [[{{function_node __inference_run_106}}{{node dense_1/Add/ReadVariableOp}}]]
2024-08-01 15:25:39.144340: I tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: FAILED_PRECONDITION: Could not find variable dense/bias. This could mean that the variable has been deleted. In TF1, it can also mean the variable is uninitialized. Debug info: container=localhost, status error message=Resource localhost/dense/bias/N10tensorflow3VarE does not exist.
	 [[{{function_node __inference_run_106}}{{node dense_1/Add/ReadVariableOp}}]]
	 [[StatefulPartitionedCall/_21]]
2024-08-01 15:25:39.144423: I tensorflow/core/framework/local_rendezvous.cc:423] Local rendezvous recv item cancelled. Key hash: 12615348601576968325
Traceback (most recent call last):
  File "/home/iantolic-soban/tf_bug/.venv/lib/python3.10/site-packages/tensorflow/python/client/session.py", line 1401, in _do_call
    return fn(*args)
  File "/home/iantolic-soban/tf_bug/.venv/lib/python3.10/site-packages/tensorflow/python/client/session.py", line 1384, in _run_fn
    return self._call_tf_sessionrun(options, feed_dict, fetch_list,
  File "/home/iantolic-soban/tf_bug/.venv/lib/python3.10/site-packages/tensorflow/python/client/session.py", line 1477, in _call_tf_sessionrun
    return tf_session.TF_SessionRun_wrapper(self._session, options, feed_dict,
tensorflow.python.framework.errors_impl.FailedPreconditionError: 2 root error(s) found.
  (0) FAILED_PRECONDITION: Could not find variable dense/bias. This could mean that the variable has been deleted. In TF1, it can also mean the variable is uninitialized. Debug info: container=localhost, status error message=Resource localhost/dense/bias/N10tensorflow3VarE does not exist.
	 [[{{function_node __inference_run_106}}{{node dense_1/Add/ReadVariableOp}}]]
	 [[StatefulPartitionedCall/_21]]
  (1) FAILED_PRECONDITION: Could not find variable dense/bias. This could mean that the variable has been deleted. In TF1, it can also mean the variable is uninitialized. Debug info: container=localhost, status error message=Resource localhost/dense/bias/N10tensorflow3VarE does not exist.
	 [[{{function_node __inference_run_106}}{{node dense_1/Add/ReadVariableOp}}]]
0 successful operations.
0 derived errors ignored.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/iantolic-soban/tf_bug/.venv/bin/saved_model_cli", line 8, in <module>
    sys.exit(main())
  File "/home/iantolic-soban/tf_bug/.venv/lib/python3.10/site-packages/tensorflow/python/tools/saved_model_cli.py", line 1340, in main
    app.run(smcli_main)
  File "/home/iantolic-soban/tf_bug/.venv/lib/python3.10/site-packages/absl/app.py", line 308, in run
    _run_main(main, args)
  File "/home/iantolic-soban/tf_bug/.venv/lib/python3.10/site-packages/absl/app.py", line 254, in _run_main
    sys.exit(main(argv))
  File "/home/iantolic-soban/tf_bug/.venv/lib/python3.10/site-packages/tensorflow/python/tools/saved_model_cli.py", line 1338, in smcli_main
    args.func()
  File "/home/iantolic-soban/tf_bug/.venv/lib/python3.10/site-packages/tensorflow/python/tools/saved_model_cli.py", line 1036, in run
    run_saved_model_with_feed_dict(
  File "/home/iantolic-soban/tf_bug/.venv/lib/python3.10/site-packages/tensorflow/python/tools/saved_model_cli.py", line 721, in run_saved_model_with_feed_dict
    outputs = sess.run(output_tensor_names_sorted, feed_dict=inputs_feed_dict)
  File "/home/iantolic-soban/tf_bug/.venv/lib/python3.10/site-packages/tensorflow/python/client/session.py", line 971, in run
    result = self._run(None, fetches, feed_dict, options_ptr,
  File "/home/iantolic-soban/tf_bug/.venv/lib/python3.10/site-packages/tensorflow/python/client/session.py", line 1214, in _run
    results = self._do_run(handle, final_targets, final_fetches,
  File "/home/iantolic-soban/tf_bug/.venv/lib/python3.10/site-packages/tensorflow/python/client/session.py", line 1394, in _do_run
    return self._do_call(_run_fn, feeds, fetches, targets, options,
  File "/home/iantolic-soban/tf_bug/.venv/lib/python3.10/site-packages/tensorflow/python/client/session.py", line 1420, in _do_call
    raise type(e)(node_def, op, message)  # pylint: disable=no-value-for-parameter
tensorflow.python.framework.errors_impl.FailedPreconditionError: Graph execution error:

2 root error(s) found.
  (0) FAILED_PRECONDITION: Could not find variable dense/bias. This could mean that the variable has been deleted. In TF1, it can also mean the variable is uninitialized. Debug info: container=localhost, status error message=Resource localhost/dense/bias/N10tensorflow3VarE does not exist.
	 [[{{node dense_1/Add/ReadVariableOp}}]]
	 [[StatefulPartitionedCall/_21]]
  (1) FAILED_PRECONDITION: Could not find variable dense/bias. This could mean that the variable has been deleted. In TF1, it can also mean the variable is uninitialized. Debug info: container=localhost, status error message=Resource localhost/dense/bias/N10tensorflow3VarE does not exist.
	 [[{{node dense_1/Add/ReadVariableOp}}]]
0 successful operations.
0 derived errors ignored.

ivansoban avatar Aug 07 '24 19:08 ivansoban

@ivansoban ,

This is a consequence of Tensorflow moving to Keras 3. Tensorflow 2.15 and lower use Keras 2 by default. Tensorflow 2.16 and higher use Keras 3 by default. Some background here.

Keras 3 is multi-backend, so the Model class and Layer class no longer inherit from tf.Module to be compatible with JAX and Torch. One consequenceis that variables are no longer automatically tracked recursively through models and layers.

To solve this, you have several options:

  • revert to tf_keras (Keras 2) by using TF_USE_LEGACY_KERAS=1 on the command line or this in Python before importing tensorflow:
import os
os.environ["TF_USE_LEGACY_KERAS"] = "1"

hertschuh avatar Sep 11 '24 22:09 hertschuh

Thank you @hertschuh for the information and solutions.

If I am understanding the implications of the move from Keras 2 to 3 correctly, the saving of a Keras 3 layer/model as an instance variable within a tf.Module is no longer really feasible if we want to export the tf.Module as a SavedModel. That is unless we want to pursue option 3 which does seem like a poor choice.

ivansoban avatar Sep 12 '24 15:09 ivansoban

@ivansoban ,

Correct. Any reason why you cannot use option 2?

hertschuh avatar Sep 12 '24 17:09 hertschuh

This issue is stale because it has been open for 14 days with no activity. It will be closed if no further activity occurs. Thank you.

github-actions[bot] avatar Oct 03 '24 02:10 github-actions[bot]

This issue was closed because it has been inactive for 28 days. Please reopen if you'd like to work on this further.

github-actions[bot] avatar Oct 17 '24 02:10 github-actions[bot]

Are you satisfied with the resolution of your issue? Yes No

google-ml-butler[bot] avatar Oct 17 '24 02:10 google-ml-butler[bot]