tensor2tensor icon indicating copy to clipboard operation
tensor2tensor copied to clipboard

T2T 1.15.7 version with Tensorflow 2.2 - t2t-decoder doesn't run

Open assij opened this issue 4 years ago • 18 comments

Description

When running t2t-decoder script ( En-De transformer-big) on a model which was trained on 8 GPUs using DistributedMirrorStrategy.

I get the following error ValueError: Tensor("body/parallel_0/body/decoder/layer_0/self_attention/multihead_attention/dot_product_attention/attention:0", shape=(), dtype=string, device=/device:GPU:0) must be from the same graph as Tensor("transformer_hparams:0", shape=(), dtype=string). ...

Environment information

OS: <your answer here>
Ubuntu 18.04.4 LTS

$ pip freeze | grep tensor

tensorboard==2.2.2
tensorboard-plugin-wit==1.7.0
tensorflow-addons==0.11.2
tensorflow-datasets==2.1.0
tensorflow-estimator==2.2.0
tensorflow-gan==2.0.0
tensorflow-gpu==2.2.0
tensorflow-hub==0.9.0
tensorflow-metadata==0.23.0
tensorflow-probability==0.7.0

$ python -V
Python 3.6.10 :: Anaconda, Inc.

For bugs: reproduction and error logs

# Steps to reproduce:
Run t2t-decoder from an input file
# Error logs:
INFO:tensorflow:Done calling model_fn.
I0913 15:54:39.852982 140520229324608 estimator.py:1171] Done calling model_fn.
Traceback (most recent call last):
  File "t2t-decoder", line 23, in <module>
    tf.app.run()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/platform/app.py", line 40, in run
    _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
  File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 299, in run
    _run_main(main, args)
  File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "t2t-decoder", line 15, in main
    t2t_decoder.main(argv)
  File "/workdisk/tf2_2/tensor2tensor/tensor2tensor/bin/t2t_decoder.py", line 210, in main
    decode(estimator, hp, decode_hp)
  File "/workdisk/tf2_2/tensor2tensor/tensor2tensor/bin/t2t_decoder.py", line 99, in decode
    checkpoint_path=FLAGS.checkpoint_path)
  File "/workdisk/tf2_2/tensor2tensor/tensor2tensor/utils/decoding.py", line 481, in decode_from_file
    for elapsed_time, result in timer(result_iter):
  File "/workdisk/tf2_2/tensor2tensor/tensor2tensor/utils/decoding.py", line 473, in timer
    item = next(gen)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 629, in predict
    hooks=all_hooks) as mon_sess:
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/monitored_session.py", line 1038, in __init__
    stop_grace_period_secs=stop_grace_period_secs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/monitored_session.py", line 749, in __init__
    self._sess = _RecoverableSession(self._coordinated_creator)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/monitored_session.py", line 1231, in __init__
    _WrappedSession.__init__(self, self._create_session())
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/monitored_session.py", line 1236, in _create_session
    return self._sess_creator.create_session()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/monitored_session.py", line 902, in create_session
    self.tf_sess = self._session_creator.create_session()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/monitored_session.py", line 660, in create_session
    self._scaffold.finalize()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/monitored_session.py", line 232, in finalize
    summary.merge_all)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/monitored_session.py", line 297, in get_or_default
    op = default_constructor()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/summary/summary.py", line 406, in merge_all
    return merge(summary_ops, name=name)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/summary/summary.py", line 370, in merge
    with _ops.name_scope(name, 'Merge', inputs):
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 6284, in __enter__
    g_from_inputs = _get_graph_from_inputs(self._values)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 5921, in _get_graph_from_inputs
    _assert_same_graph(original_graph_element, graph_element)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 5856, in _assert_same_graph
    (item, original_item))
ValueError: Tensor("body/parallel_0/body/decoder/layer_0/self_attention/multihead_attention/dot_product_attention/attention:0", shape=(), dtype=string, device=/device:GPU:0) must be from the same graph as Tensor("transformer_hparams:0", shape=(), dtype=string).
(tf2p2) ajakoby@debug-ttn7l:/workdisk/ajakoby/tf2_2/tensor2tensor/tensor2tensor/bin$ ajakoby@ajakoby-VM:~/Kubernetes$ 

assij avatar Sep 14 '20 05:09 assij

same here, i simply changed

import tensorflow as tf

import tensorflow.compat.v1 as tf

neverdoubt avatar Sep 19 '20 12:09 neverdoubt

same here, i simply changed

import tensorflow as tf

import tensorflow.compat.v1 as tf

I tried this but it did not fix it (same issue as OP)

wjm41 avatar Sep 24 '20 10:09 wjm41

@wjm41 I was wondering whether you fixed it or not. I have exactly the same issue here.

baojianzhou avatar Sep 28 '20 18:09 baojianzhou

Haven't been able to fix it yet - looks like it's something to do with the save/loading of the model but I'm not experienced enough with TF to know where to look :(

wjm41 avatar Sep 30 '20 10:09 wjm41

@wjm41 Thanks for replying. I fixed mine by adding the following:

import tensorflow.compat.v1 as tf tf.disable_v2_behavior()

baojianzhou avatar Sep 30 '20 13:09 baojianzhou

@baojianzhou I tried adding that to both t2t-decoder and t2t-trainer which gives me a new error:

tensorflow.python.framework.errors_impl.NotFoundError: Restoring from checkpoint failed. This is most likely due to a Variable name or other graph key that is missing from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:

Key transformer/body/parallel_0/body/encoder/layer_0/ffn/conv1/bias not found in checkpoint
	 [[node save/RestoreV2_1 (defined at /lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py:630) ]]

wjm41 avatar Sep 30 '20 14:09 wjm41

@wjm41, I believe got the same error too, if I recall it clearly. Have you retrained your model yet?

The reason is that, if you load the checkpoint (the model trained without adding tf.disable_v2_behavior()), Tensorflow will somehow still use some V2 features. My solution is that I just retrained the model from the beginning. The decoder process can be successfully finished after using the new trained checkpoint. Hope it helps.

baojianzhou avatar Sep 30 '20 14:09 baojianzhou

@baojianzhou Yes it's working now! Thanks so much :)

wjm41 avatar Sep 30 '20 15:09 wjm41

@baojianzhou I trained the model again with t2t-trainer having tf.disable_v2_behavior(), however the t2t-decoder still has issues. Can you please attach the files that you are using including the train command line + decoder command line.

assij avatar Sep 30 '20 15:09 assij

@assij my t2t-trainer looks like this:

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

from tensor2tensor.bin import t2t_trainer

import tensorflow.compat.v1 as tf

def main(argv):
  t2t_trainer.main(argv)


if __name__ == "__main__":
  tf.disable_v2_behavior()
  tf.logging.set_verbosity(tf.logging.INFO)
  tf.app.run(main)

and my t2t-decoder looks like this:

"""t2t-decoder."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

#import tensorflow.compat.v1 as tf
from tensor2tensor.bin import t2t_decoder
import logging
#import tensorflow as tf
import tensorflow.compat.v1 as tf

def main(argv):
  t2t_decoder.main(argv)


if __name__ == "__main__":
  tf.disable_v2_behavior()
  tf.logging.set_verbosity(tf.logging.INFO)
  tf.app.run()

wjm41 avatar Sep 30 '20 16:09 wjm41

@wjm41 Thanks, are you using the t2t-trainer with --optionally_use_dist_strat=True ?

assij avatar Sep 30 '20 17:09 assij

@assij No I wasn't - I got it working for a transformer on a custom PROBLEM, not sure that changing hparams should affect this problem in particular.

wjm41 avatar Sep 30 '20 18:09 wjm41

@wjm41 are you using t2t tag 1.15.7 as is with only the above 2 changes? are you doing training on multiple GPUs or 1 GPU? I'm working on multiple GPUs. can you please send the result of pip freeze | grep tensor

assij avatar Sep 30 '20 18:09 assij

@wjm41 are you using t2t tag 1.15.7 as is with only the above 2 changes? are you doing training on multiple GPUs or 1 GPU? I'm working on multiple GPUs. can you please send the result of pip freeze | grep tensor

I've got exactly the same issue. I've tried solution mentioned above, but it's still not working... Have you fixed it?

vikingmars avatar Oct 27 '20 09:10 vikingmars

2020-11-11 05:25:43.286162: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at save_restore_v2_ops.cc:184 : Not found: Key evolved_transformer/body/decoder/layer_0/first_attend_to_encoder/multihead_attention/k/kernel not found in checkpoint 2020-11-11 05:25:43.286801: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at save_restore_v2_ops.cc:184 : Not found: Key evolved_transformer/body/parallel_0/body/encoder/layer_0/conv_branches/dense_2/bias not found in checkpoint Traceback (most recent call last): File "/home/WwhStuGrp/WwhStu11G/anaconda3/envs/py3.7-tensorflow/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1365, in _do_call return fn(*args) File "/home/WwhStuGrp/WwhStu11G/anaconda3/envs/py3.7-tensorflow/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1350, in _run_fn target_list, run_metadata) File "/home/WwhStuGrp/WwhStu11G/anaconda3/envs/py3.7-tensorflow/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1443, in _call_tf_sessionrun run_metadata) tensorflow.python.framework.errors_impl.NotFoundError: 2 root error(s) found. (0) Not found: Key evolved_transformer/body/decoder/layer_0/first_attend_to_encoder/multihead_attention/k/kernel not found in checkpoint [[{{node save/RestoreV2}}]] (1) Not found: Key evolved_transformer/body/decoder/layer_0/first_attend_to_encoder/multihead_attention/k/kernel not found in checkpoint [[{{node save/RestoreV2}}]] [[save/RestoreV2_1/_13]] 0 successful operations. 0 derived errors ignored.

During handling of the above exception, another exception occurred:

Nanamumuhan avatar Nov 11 '20 10:11 Nanamumuhan

@wjm41, I believe got the same error too, if I recall it clearly. Have you retrained your model yet?

The reason is that, if you load the checkpoint (the model trained without adding tf.disable_v2_behavior()), Tensorflow will somehow still use some V2 features. My solution is that I just retrained the model from the beginning. The decoder process can be successfully finished after using the new trained checkpoint. Hope it helps.

I retrained the model,and add tf.disable_v2_behavior() to t2t-trainer ,t2t-decoder,t2t-translate-all,but I still have the problem : root error(s) found. (0) Not found: Key transformer/body/decoder/layer_0/encdec_attention/multihead_attention/k/kernel not found in checkpoint [[node save/RestoreV2 (defined at /lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py:629) ]] (1) Not found: Key transformer/body/decoder/layer_0/encdec_attention/multihead_attention/k/kernel not found in checkpoint [[node save/RestoreV2 (defined at /lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py:629) ]] [[save/RestoreV2_1/_249]]
Do you know the reason?

DawsenWSH avatar Feb 20 '21 03:02 DawsenWSH

2020-11-11 05:25:43.286162: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at save_restore_v2_ops.cc:184 : Not found: Key evolved_transformer/body/decoder/layer_0/first_attend_to_encoder/multihead_attention/k/kernel not found in checkpoint 2020-11-11 05:25:43.286801: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at save_restore_v2_ops.cc:184 : Not found: Key evolved_transformer/body/parallel_0/body/encoder/layer_0/conv_branches/dense_2/bias not found in checkpoint Traceback (most recent call last): File "/home/WwhStuGrp/WwhStu11G/anaconda3/envs/py3.7-tensorflow/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1365, in _do_call return fn(*args) File "/home/WwhStuGrp/WwhStu11G/anaconda3/envs/py3.7-tensorflow/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1350, in _run_fn target_list, run_metadata) File "/home/WwhStuGrp/WwhStu11G/anaconda3/envs/py3.7-tensorflow/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1443, in _call_tf_sessionrun run_metadata) tensorflow.python.framework.errors_impl.NotFoundError: 2 root error(s) found. (0) Not found: Key evolved_transformer/body/decoder/layer_0/first_attend_to_encoder/multihead_attention/k/kernel not found in checkpoint [[{{node save/RestoreV2}}]] (1) Not found: Key evolved_transformer/body/decoder/layer_0/first_attend_to_encoder/multihead_attention/k/kernel not found in checkpoint [[{{node save/RestoreV2}}]] [[save/RestoreV2_1/_13]] 0 successful operations. 0 derived errors ignored.

During handling of the above exception, another exception occurred:

You should install tensor2tensor from github like as below:

git clone https://github.com/tensorflow/tensor2tensor.git
cd tensor2tensor
pip install .

then replace t2t-trainer and t2t-decoder to https://github.com/tensorflow/tensor2tensor/issues/1849#issuecomment-701491229

hashk1 avatar Feb 20 '21 14:02 hashk1

2020-11-11 05:25:43.286162: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at save_restore_v2_ops.cc:184 : Not found: Key evolved_transformer/body/decoder/layer_0/first_attend_to_encoder/multihead_attention/k/kernel not found in checkpoint 2020-11-11 05:25:43.286801: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at save_restore_v2_ops.cc:184 : Not found: Key evolved_transformer/body/parallel_0/body/encoder/layer_0/conv_branches/dense_2/bias not found in checkpoint Traceback (most recent call last): File "/home/WwhStuGrp/WwhStu11G/anaconda3/envs/py3.7-tensorflow/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1365, in _do_call return fn(*args) File "/home/WwhStuGrp/WwhStu11G/anaconda3/envs/py3.7-tensorflow/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1350, in _run_fn target_list, run_metadata) File "/home/WwhStuGrp/WwhStu11G/anaconda3/envs/py3.7-tensorflow/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1443, in _call_tf_sessionrun run_metadata) tensorflow.python.framework.errors_impl.NotFoundError: 2 root error(s) found. (0) Not found: Key evolved_transformer/body/decoder/layer_0/first_attend_to_encoder/multihead_attention/k/kernel not found in checkpoint [[{{node save/RestoreV2}}]] (1) Not found: Key evolved_transformer/body/decoder/layer_0/first_attend_to_encoder/multihead_attention/k/kernel not found in checkpoint [[{{node save/RestoreV2}}]] [[save/RestoreV2_1/_13]] 0 successful operations. 0 derived errors ignored. During handling of the above exception, another exception occurred:

You should install tensor2tensor from github like as below:

git clone https://github.com/tensorflow/tensor2tensor.git
cd tensor2tensor
pip install .

then replace t2t-trainer and t2t-decoder to #1849 (comment)

Yes,it's working now!thanks very much!

DawsenWSH avatar Feb 22 '21 02:02 DawsenWSH