tensor2tensor
tensor2tensor copied to clipboard
T2T 1.15.7 version with Tensorflow 2.2 - t2t-decoder doesn't run
Description
When running t2t-decoder script ( En-De transformer-big) on a model which was trained on 8 GPUs using DistributedMirrorStrategy.
I get the following error ValueError: Tensor("body/parallel_0/body/decoder/layer_0/self_attention/multihead_attention/dot_product_attention/attention:0", shape=(), dtype=string, device=/device:GPU:0) must be from the same graph as Tensor("transformer_hparams:0", shape=(), dtype=string). ...
Environment information
OS: <your answer here>
Ubuntu 18.04.4 LTS
$ pip freeze | grep tensor
tensorboard==2.2.2
tensorboard-plugin-wit==1.7.0
tensorflow-addons==0.11.2
tensorflow-datasets==2.1.0
tensorflow-estimator==2.2.0
tensorflow-gan==2.0.0
tensorflow-gpu==2.2.0
tensorflow-hub==0.9.0
tensorflow-metadata==0.23.0
tensorflow-probability==0.7.0
$ python -V
Python 3.6.10 :: Anaconda, Inc.
For bugs: reproduction and error logs
# Steps to reproduce:
Run t2t-decoder from an input file
# Error logs:
INFO:tensorflow:Done calling model_fn.
I0913 15:54:39.852982 140520229324608 estimator.py:1171] Done calling model_fn.
Traceback (most recent call last):
File "t2t-decoder", line 23, in <module>
tf.app.run()
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/platform/app.py", line 40, in run
_run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 299, in run
_run_main(main, args)
File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 250, in _run_main
sys.exit(main(argv))
File "t2t-decoder", line 15, in main
t2t_decoder.main(argv)
File "/workdisk/tf2_2/tensor2tensor/tensor2tensor/bin/t2t_decoder.py", line 210, in main
decode(estimator, hp, decode_hp)
File "/workdisk/tf2_2/tensor2tensor/tensor2tensor/bin/t2t_decoder.py", line 99, in decode
checkpoint_path=FLAGS.checkpoint_path)
File "/workdisk/tf2_2/tensor2tensor/tensor2tensor/utils/decoding.py", line 481, in decode_from_file
for elapsed_time, result in timer(result_iter):
File "/workdisk/tf2_2/tensor2tensor/tensor2tensor/utils/decoding.py", line 473, in timer
item = next(gen)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 629, in predict
hooks=all_hooks) as mon_sess:
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/monitored_session.py", line 1038, in __init__
stop_grace_period_secs=stop_grace_period_secs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/monitored_session.py", line 749, in __init__
self._sess = _RecoverableSession(self._coordinated_creator)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/monitored_session.py", line 1231, in __init__
_WrappedSession.__init__(self, self._create_session())
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/monitored_session.py", line 1236, in _create_session
return self._sess_creator.create_session()
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/monitored_session.py", line 902, in create_session
self.tf_sess = self._session_creator.create_session()
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/monitored_session.py", line 660, in create_session
self._scaffold.finalize()
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/monitored_session.py", line 232, in finalize
summary.merge_all)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/monitored_session.py", line 297, in get_or_default
op = default_constructor()
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/summary/summary.py", line 406, in merge_all
return merge(summary_ops, name=name)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/summary/summary.py", line 370, in merge
with _ops.name_scope(name, 'Merge', inputs):
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 6284, in __enter__
g_from_inputs = _get_graph_from_inputs(self._values)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 5921, in _get_graph_from_inputs
_assert_same_graph(original_graph_element, graph_element)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 5856, in _assert_same_graph
(item, original_item))
ValueError: Tensor("body/parallel_0/body/decoder/layer_0/self_attention/multihead_attention/dot_product_attention/attention:0", shape=(), dtype=string, device=/device:GPU:0) must be from the same graph as Tensor("transformer_hparams:0", shape=(), dtype=string).
(tf2p2) ajakoby@debug-ttn7l:/workdisk/ajakoby/tf2_2/tensor2tensor/tensor2tensor/bin$ ajakoby@ajakoby-VM:~/Kubernetes$
same here, i simply changed
import tensorflow as tf
import tensorflow.compat.v1 as tf
same here, i simply changed
import tensorflow as tf
import tensorflow.compat.v1 as tf
I tried this but it did not fix it (same issue as OP)
@wjm41 I was wondering whether you fixed it or not. I have exactly the same issue here.
Haven't been able to fix it yet - looks like it's something to do with the save/loading of the model but I'm not experienced enough with TF to know where to look :(
@wjm41 Thanks for replying. I fixed mine by adding the following:
import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()
@baojianzhou I tried adding that to both t2t-decoder
and t2t-trainer
which gives me a new error:
tensorflow.python.framework.errors_impl.NotFoundError: Restoring from checkpoint failed. This is most likely due to a Variable name or other graph key that is missing from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:
Key transformer/body/parallel_0/body/encoder/layer_0/ffn/conv1/bias not found in checkpoint
[[node save/RestoreV2_1 (defined at /lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py:630) ]]
@wjm41, I believe got the same error too, if I recall it clearly. Have you retrained your model yet?
The reason is that, if you load the checkpoint (the model trained without adding tf.disable_v2_behavior()), Tensorflow will somehow still use some V2 features. My solution is that I just retrained the model from the beginning. The decoder process can be successfully finished after using the new trained checkpoint. Hope it helps.
@baojianzhou Yes it's working now! Thanks so much :)
@baojianzhou I trained the model again with t2t-trainer having tf.disable_v2_behavior(), however the t2t-decoder still has issues. Can you please attach the files that you are using including the train command line + decoder command line.
@assij my t2t-trainer
looks like this:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from tensor2tensor.bin import t2t_trainer
import tensorflow.compat.v1 as tf
def main(argv):
t2t_trainer.main(argv)
if __name__ == "__main__":
tf.disable_v2_behavior()
tf.logging.set_verbosity(tf.logging.INFO)
tf.app.run(main)
and my t2t-decoder
looks like this:
"""t2t-decoder."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
#import tensorflow.compat.v1 as tf
from tensor2tensor.bin import t2t_decoder
import logging
#import tensorflow as tf
import tensorflow.compat.v1 as tf
def main(argv):
t2t_decoder.main(argv)
if __name__ == "__main__":
tf.disable_v2_behavior()
tf.logging.set_verbosity(tf.logging.INFO)
tf.app.run()
@wjm41 Thanks, are you using the t2t-trainer with --optionally_use_dist_strat=True ?
@assij No I wasn't - I got it working for a transformer on a custom PROBLEM
, not sure that changing hparams should affect this problem in particular.
@wjm41 are you using t2t tag 1.15.7 as is with only the above 2 changes? are you doing training on multiple GPUs or 1 GPU? I'm working on multiple GPUs. can you please send the result of pip freeze | grep tensor
@wjm41 are you using t2t tag 1.15.7 as is with only the above 2 changes? are you doing training on multiple GPUs or 1 GPU? I'm working on multiple GPUs. can you please send the result of pip freeze | grep tensor
I've got exactly the same issue. I've tried solution mentioned above, but it's still not working... Have you fixed it?
2020-11-11 05:25:43.286162: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at save_restore_v2_ops.cc:184 : Not found: Key evolved_transformer/body/decoder/layer_0/first_attend_to_encoder/multihead_attention/k/kernel not found in checkpoint 2020-11-11 05:25:43.286801: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at save_restore_v2_ops.cc:184 : Not found: Key evolved_transformer/body/parallel_0/body/encoder/layer_0/conv_branches/dense_2/bias not found in checkpoint Traceback (most recent call last): File "/home/WwhStuGrp/WwhStu11G/anaconda3/envs/py3.7-tensorflow/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1365, in _do_call return fn(*args) File "/home/WwhStuGrp/WwhStu11G/anaconda3/envs/py3.7-tensorflow/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1350, in _run_fn target_list, run_metadata) File "/home/WwhStuGrp/WwhStu11G/anaconda3/envs/py3.7-tensorflow/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1443, in _call_tf_sessionrun run_metadata) tensorflow.python.framework.errors_impl.NotFoundError: 2 root error(s) found. (0) Not found: Key evolved_transformer/body/decoder/layer_0/first_attend_to_encoder/multihead_attention/k/kernel not found in checkpoint [[{{node save/RestoreV2}}]] (1) Not found: Key evolved_transformer/body/decoder/layer_0/first_attend_to_encoder/multihead_attention/k/kernel not found in checkpoint [[{{node save/RestoreV2}}]] [[save/RestoreV2_1/_13]] 0 successful operations. 0 derived errors ignored.
During handling of the above exception, another exception occurred:
@wjm41, I believe got the same error too, if I recall it clearly. Have you retrained your model yet?
The reason is that, if you load the checkpoint (the model trained without adding tf.disable_v2_behavior()), Tensorflow will somehow still use some V2 features. My solution is that I just retrained the model from the beginning. The decoder process can be successfully finished after using the new trained checkpoint. Hope it helps.
I retrained the model,and add tf.disable_v2_behavior()
to t2t-trainer ,t2t-decoder,t2t-translate-all,but I still have the problem :
root error(s) found. (0) Not found: Key transformer/body/decoder/layer_0/encdec_attention/multihead_attention/k/kernel not found in checkpoint [[node save/RestoreV2 (defined at /lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py:629) ]] (1) Not found: Key transformer/body/decoder/layer_0/encdec_attention/multihead_attention/k/kernel not found in checkpoint [[node save/RestoreV2 (defined at /lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py:629) ]] [[save/RestoreV2_1/_249]]
Do you know the reason?
2020-11-11 05:25:43.286162: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at save_restore_v2_ops.cc:184 : Not found: Key evolved_transformer/body/decoder/layer_0/first_attend_to_encoder/multihead_attention/k/kernel not found in checkpoint 2020-11-11 05:25:43.286801: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at save_restore_v2_ops.cc:184 : Not found: Key evolved_transformer/body/parallel_0/body/encoder/layer_0/conv_branches/dense_2/bias not found in checkpoint Traceback (most recent call last): File "/home/WwhStuGrp/WwhStu11G/anaconda3/envs/py3.7-tensorflow/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1365, in _do_call return fn(*args) File "/home/WwhStuGrp/WwhStu11G/anaconda3/envs/py3.7-tensorflow/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1350, in _run_fn target_list, run_metadata) File "/home/WwhStuGrp/WwhStu11G/anaconda3/envs/py3.7-tensorflow/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1443, in _call_tf_sessionrun run_metadata) tensorflow.python.framework.errors_impl.NotFoundError: 2 root error(s) found. (0) Not found: Key evolved_transformer/body/decoder/layer_0/first_attend_to_encoder/multihead_attention/k/kernel not found in checkpoint [[{{node save/RestoreV2}}]] (1) Not found: Key evolved_transformer/body/decoder/layer_0/first_attend_to_encoder/multihead_attention/k/kernel not found in checkpoint [[{{node save/RestoreV2}}]] [[save/RestoreV2_1/_13]] 0 successful operations. 0 derived errors ignored.
During handling of the above exception, another exception occurred:
You should install tensor2tensor from github like as below:
git clone https://github.com/tensorflow/tensor2tensor.git
cd tensor2tensor
pip install .
then replace t2t-trainer and t2t-decoder to https://github.com/tensorflow/tensor2tensor/issues/1849#issuecomment-701491229
2020-11-11 05:25:43.286162: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at save_restore_v2_ops.cc:184 : Not found: Key evolved_transformer/body/decoder/layer_0/first_attend_to_encoder/multihead_attention/k/kernel not found in checkpoint 2020-11-11 05:25:43.286801: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at save_restore_v2_ops.cc:184 : Not found: Key evolved_transformer/body/parallel_0/body/encoder/layer_0/conv_branches/dense_2/bias not found in checkpoint Traceback (most recent call last): File "/home/WwhStuGrp/WwhStu11G/anaconda3/envs/py3.7-tensorflow/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1365, in _do_call return fn(*args) File "/home/WwhStuGrp/WwhStu11G/anaconda3/envs/py3.7-tensorflow/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1350, in _run_fn target_list, run_metadata) File "/home/WwhStuGrp/WwhStu11G/anaconda3/envs/py3.7-tensorflow/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1443, in _call_tf_sessionrun run_metadata) tensorflow.python.framework.errors_impl.NotFoundError: 2 root error(s) found. (0) Not found: Key evolved_transformer/body/decoder/layer_0/first_attend_to_encoder/multihead_attention/k/kernel not found in checkpoint [[{{node save/RestoreV2}}]] (1) Not found: Key evolved_transformer/body/decoder/layer_0/first_attend_to_encoder/multihead_attention/k/kernel not found in checkpoint [[{{node save/RestoreV2}}]] [[save/RestoreV2_1/_13]] 0 successful operations. 0 derived errors ignored. During handling of the above exception, another exception occurred:
You should install tensor2tensor from github like as below:
git clone https://github.com/tensorflow/tensor2tensor.git cd tensor2tensor pip install .
then replace t2t-trainer and t2t-decoder to #1849 (comment)
Yes,it's working now!thanks very much!