tensor2tensor icon indicating copy to clipboard operation
tensor2tensor copied to clipboard

TPU failing to reinitialize after eval

Open mikeymezher opened this issue 5 years ago • 9 comments

TPUs running tensorflow 1.11.0 are failing to reinitialize after evaluation during "continuous_train_and_eval" schedule. Not sure if this is a bug in Tensor2Tensor or TF's scheduling for TPUs.

This problem does not exist with TPUs running tensorflow 1.9.0

Still able to train, but have to set schedule to "train" and avoid evaluation.

Environment information

tensor2tensor==1.10.0 tensorboard==1.11.0 tensorflow==1.11.0 tensorflow-metadata==0.9.0 tensorflow-serving-api==1.11.0 tensorflow-transform==0.9.0

Trace: INFO:tensorflow:Done calling model_fn. INFO:tensorflow:Starting evaluation at 2018-11-05-18:14:34 INFO:tensorflow:TPU job name tpu_worker INFO:tensorflow:Graph was finalized. INFO:tensorflow:Restoring parameters from gs://besc-route-prediction/tensor2tensor/train_variableGHR/tpu/model_kinSparse_singleShard/model.ckpt-1000 INFO:tensorflow:Running local_init_op. INFO:tensorflow:Done running local_init_op. INFO:tensorflow:Init TPU system INFO:tensorflow:Starting infeed thread controller. INFO:tensorflow:Starting outfeed thread controller. INFO:tensorflow:Enqueue next (30) batch(es) of data to infeed. INFO:tensorflow:Dequeue next (30) batch(es) of data from outfeed. INFO:tensorflow:Evaluation [30/30] INFO:tensorflow:Stop infeed thread controller INFO:tensorflow:Shutting down InfeedController thread. INFO:tensorflow:InfeedController received shutdown signal, stopping. INFO:tensorflow:Infeed thread finished, shutting down. INFO:tensorflow:infeed marked as finished INFO:tensorflow:Stop output thread controller INFO:tensorflow:Shutting down OutfeedController thread. INFO:tensorflow:OutfeedController received shutdown signal, stopping. INFO:tensorflow:Outfeed thread finished, shutting down. INFO:tensorflow:outfeed marked as finished INFO:tensorflow:Shutdown TPU system. INFO:tensorflow:Finished evaluation at 2018-11-05-18:15:24 INFO:tensorflow:Saving dict for global step 1000: global_step = 1000, loss = 5.689253, targets/metrics-routes_packed/accuracy = 0.14732198, targets/metrics-routes_packed/accuracy_per_sequence = 0.0, targets/metrics-routes_packed/accuracy_top5 = 0.31858853, targets/metrics-routes_packed/neg_log_perplexity = -5.773135, targets_position/metrics-routes_packed/accuracy = 0.019026693, targets_position/metrics-routes_packed/accuracy_per_sequence = 0.0, targets_position/metrics-routes_packed/accuracy_top5 = 0.041145835, targets_position/metrics-routes_packed/neg_log_perplexity = -12.372437, targets_segmentation/metrics-routes_packed/accuracy = 0.019026693, targets_segmentation/metrics-routes_packed/accuracy_per_sequence = 0.0, targets_segmentation/metrics-routes_packed/accuracy_top5 = 0.041145835, targets_segmentation/metrics-routes_packed/neg_log_perplexity = -12.372437 INFO:tensorflow:Saving 'checkpoint_path' summary for global step 1000: gs://besc-route-prediction/tensor2tensor/train_variableGHR/tpu/model_kinSparse_singleShard/model.ckpt-1000 INFO:tensorflow:evaluation_loop marked as finished INFO:tensorflow:loss = 4.7897573, step = 1000 (88.605 sec) INFO:tensorflow:global_step/sec: 1.1286 INFO:tensorflow:examples/sec: 9.02881 INFO:tensorflow:Enqueue next (100) batch(es) of data to infeed. INFO:tensorflow:Dequeue next (100) batch(es) of data from outfeed. INFO:tensorflow:Error recorded from training_loop: The TPU system has not been initialized. [[{{node TPUReplicate/_compile/_18113138105308393941/_194}} = TPUCompileNumDynamicShapes=0, Tguaranteed_constants=[], function=cluster_14266836224844131312[], metadata="\n\006\010...7\323\224M", num_computations=1, _device="/job:tpu_worker/replica:0/task:0/device:CPU:0"]] [[{{node training/transformer/body/decoder/layer_3/self_attention/layer_prepostprocess/layer_norm/layer_norm_scale/Adafactor/read/_67_G943}} = _Recvclient_terminated=false, recv_device="/job:tpu_worker/replica:0/task:0/device:TPU:0", send_device="/job:tpu_worker/replica:0/task:0/device:CPU:0", send_device_incarnation=-8638422706644811085, tensor_name="edge_1039_...r/read/_67", tensor_type=DT_FLOAT, _device="/job:tpu_worker/replica:0/task:0/device:TPU:0"]] INFO:tensorflow:Error recorded from infeed: Step was cancelled by an explicit call to Session::Close(). INFO:tensorflow:training_loop marked as finished WARNING:tensorflow:Reraising captured error Traceback (most recent call last): File "/usr/local/bin/t2t-trainer", line 33, in tf.app.run() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 125, in run _sys.exit(main(argv)) File "/usr/local/bin/t2t-trainer", line 28, in main t2t_trainer.main(argv) File "/usr/local/lib/python2.7/dist-packages/tensor2tensor/bin/t2t_trainer.py", line 379, in main execute_schedule(exp) File "/usr/local/lib/python2.7/dist-packages/tensor2tensor/bin/t2t_trainer.py", line 341, in execute_schedule getattr(exp, FLAGS.schedule)() File "/usr/local/lib/python2.7/dist-packages/tensor2tensor/utils/trainer_lib.py", line 391, in continuous_train_and_eval self._eval_spec) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/training.py", line 471, in train_and_evaluate return executor.run() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/training.py", line 610, in run return self.run_local() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/training.py", line 711, in run_local saving_listeners=saving_listeners) File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2400, in train rendezvous.raise_errors() File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/error_handling.py", line 128, in raise_errors six.reraise(typ, value, traceback) File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2394, in train saving_listeners=saving_listeners File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 356, in train loss = self._train_model(input_fn, hooks, saving_listeners) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 1181, in _train_model return self._train_model_default(input_fn, hooks, saving_listeners) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 1215, in _train_model_default saving_listeners) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 1409, in _train_with_estimator_spec _, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss]) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 671, in run run_metadata=run_metadata) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 1148, in run run_metadata=run_metadata) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 1239, in run raise six.reraise(*original_exc_info) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 1224, in run return self._sess.run(*args, **kwargs) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 1296, in run run_metadata=run_metadata) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 1076, in run return self._sess.run(*args, **kwargs) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 887, in run run_metadata_ptr) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1110, in _run feed_dict_tensor, options, run_metadata) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1286, in _do_run run_metadata) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1308, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.FailedPreconditionError: The TPU system has not been initialized. [[{{node TPUReplicate/_compile/_18113138105308393941/_194}} = TPUCompileNumDynamicShapes=0, Tguaranteed_constants=[], function=cluster_14266836224844131312[], metadata="\n\006\010...7\323\224M", num_computations=1, _device="/job:tpu_worker/replica:0/task:0/device:CPU:0"]] [[{{node training/transformer/body/decoder/layer_3/self_attention/layer_prepostprocess/layer_norm/layer_norm_scale/Adafactor/read/_67_G943}} = _Recvclient_terminated=false, recv_device="/job:tpu_worker/replica:0/task:0/device:TPU:0", send_device="/job:tpu_worker/replica:0/task:0/device:CPU:0", send_device_incarnation=-8638422706644811085, tensor_name="edge_1039_...r/read/_67", tensor_type=DT_FLOAT, _device="/job:tpu_worker/replica:0/task:0/device:TPU:0"]]

mikeymezher avatar Nov 05 '18 18:11 mikeymezher

We see this error internally as well -- best to workaround it for now as you've done, will try to get to the bottom of this.

On Mon, Nov 5, 2018 at 10:21 AM Michael Mezher [email protected] wrote:

TPUs running tensorflow 1.11.0 are failing to reinitialize after evaluation during "continuous_train_and_eval" schedule. Not sure if this is a bug in Tensor2Tensor or TF's scheduling for TPUs.

This problem does not exist with TPUs running tensorflow 1.9.0

Still able to train, but have to set schedule to "train" and avoid evaluation. Environment information

tensor2tensor==1.10.0 tensorboard==1.11.0 tensorflow==1.11.0 tensorflow-metadata==0.9.0 tensorflow-serving-api==1.11.0 tensorflow-transform==0.9.0

Trace: INFO:tensorflow:Done calling model_fn. INFO:tensorflow:Starting evaluation at 2018-11-05-18:14:34 INFO:tensorflow:TPU job name tpu_worker INFO:tensorflow:Graph was finalized. INFO:tensorflow:Restoring parameters from gs://besc-route-prediction/tensor2tensor/train_variableGHR/tpu/model_kinSparse_singleShard/model.ckpt-1000 INFO:tensorflow:Running local_init_op. INFO:tensorflow:Done running local_init_op. INFO:tensorflow:Init TPU system INFO:tensorflow:Starting infeed thread controller. INFO:tensorflow:Starting outfeed thread controller. INFO:tensorflow:Enqueue next (30) batch(es) of data to infeed. INFO:tensorflow:Dequeue next (30) batch(es) of data from outfeed. INFO:tensorflow:Evaluation [30/30] INFO:tensorflow:Stop infeed thread controller INFO:tensorflow:Shutting down InfeedController thread. INFO:tensorflow:InfeedController received shutdown signal, stopping. INFO:tensorflow:Infeed thread finished, shutting down. INFO:tensorflow:infeed marked as finished INFO:tensorflow:Stop output thread controller INFO:tensorflow:Shutting down OutfeedController thread. INFO:tensorflow:OutfeedController received shutdown signal, stopping. INFO:tensorflow:Outfeed thread finished, shutting down. INFO:tensorflow:outfeed marked as finished INFO:tensorflow:Shutdown TPU system. INFO:tensorflow:Finished evaluation at 2018-11-05-18:15:24 INFO:tensorflow:Saving dict for global step 1000: global_step = 1000, loss = 5.689253, targets/metrics-routes_packed/accuracy = 0.14732198, targets/metrics-routes_packed/accuracy_per_sequence = 0.0, targets/metrics-routes_packed/accuracy_top5 = 0.31858853, targets/metrics-routes_packed/neg_log_perplexity = -5.773135, targets_position/metrics-routes_packed/accuracy = 0.019026693, targets_position/metrics-routes_packed/accuracy_per_sequence = 0.0, targets_position/metrics-routes_packed/accuracy_top5 = 0.041145835, targets_position/metrics-routes_packed/neg_log_perplexity = -12.372437, targets_segmentation/metrics-routes_packed/accuracy = 0.019026693, targets_segmentation/metrics-routes_packed/accuracy_per_sequence = 0.0, targets_segmentation/metrics-routes_packed/accuracy_top5 = 0.041145835, targets_segmentation/metrics-routes_packed/neg_log_perplexity = -12.372437 INFO:tensorflow:Saving 'checkpoint_path' summary for global step 1000: gs://besc-route-prediction/tensor2tensor/train_variableGHR/tpu/model_kinSparse_singleShard/model.ckpt-1000 INFO:tensorflow:evaluation_loop marked as finished INFO:tensorflow:loss = 4.7897573, step = 1000 (88.605 sec) INFO:tensorflow:global_step/sec: 1.1286 INFO:tensorflow:examples/sec: 9.02881 INFO:tensorflow:Enqueue next (100) batch(es) of data to infeed. INFO:tensorflow:Dequeue next (100) batch(es) of data from outfeed. INFO:tensorflow:Error recorded from training_loop: The TPU system has not been initialized. [[{{node TPUReplicate/_compile/_18113138105308393941/_194}} = TPUCompileNumDynamicShapes=0, Tguaranteed_constants=[], function=cluster_14266836224844131312[], metadata="\n\006\010...7\323\224M", num_computations=1, _device="/job:tpu_worker/replica:0/task:0/device:CPU:0"]] [[{{node training/transformer/body/decoder/layer_3/self_attention/layer_prepostprocess/layer_norm/layer_norm_scale/Adafactor/read/_67_G943}} = Recvclient_terminated=false, recv_device="/job:tpu_worker/replica:0/task:0/device:TPU:0", send_device="/job:tpu_worker/replica:0/task:0/device:CPU:0", send_device_incarnation=-8638422706644811085, tensor_name="edge_1039...r/read/_67", tensor_type=DT_FLOAT, _device="/job:tpu_worker/replica:0/task:0/device:TPU:0"]] INFO:tensorflow:Error recorded from infeed: Step was cancelled by an explicit call to Session::Close(). INFO:tensorflow:training_loop marked as finished WARNING:tensorflow:Reraising captured error Traceback (most recent call last): File "/usr/local/bin/t2t-trainer", line 33, in tf.app.run() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 125, in run _sys.exit(main(argv)) File "/usr/local/bin/t2t-trainer", line 28, in main t2t_trainer.main(argv) File "/usr/local/lib/python2.7/dist-packages/tensor2tensor/bin/t2t_trainer.py", line 379, in main execute_schedule(exp) File "/usr/local/lib/python2.7/dist-packages/tensor2tensor/bin/t2t_trainer.py", line 341, in execute_schedule getattr(exp, FLAGS.schedule)() File "/usr/local/lib/python2.7/dist-packages/tensor2tensor/utils/trainer_lib.py", line 391, in continuous_train_and_eval self._eval_spec) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/training.py", line 471, in train_and_evaluate return executor.run() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/training.py", line 610, in run return self.run_local() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/training.py", line 711, in run_local saving_listeners=saving_listeners) File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2400, in train rendezvous.raise_errors() File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/error_handling.py", line 128, in raise_errors six.reraise(typ, value, traceback) File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2394, in train saving_listeners=saving_listeners File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 356, in train loss = self._train_model(input_fn, hooks, saving_listeners) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 1181, in _train_model return self._train_model_default(input_fn, hooks, saving_listeners) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 1215, in _train_model_default saving_listeners) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 1409, in _train_with_estimator_spec _, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss]) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 671, in run run_metadata=run_metadata) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 1148, in run run_metadata=run_metadata) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 1239, in run raise six.reraise(*original_exc_info) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 1224, in run return self._sess.run(*args, **kwargs) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 1296, in run run_metadata=run_metadata) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 1076, in run return self._sess.run(*args, **kwargs) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 887, in run run_metadata_ptr) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1110, in _run feed_dict_tensor, options, run_metadata) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1286, in _do_run run_metadata) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1308, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.FailedPreconditionError: The TPU system has not been initialized. [[{{node TPUReplicate/_compile/_18113138105308393941/_194}} = TPUCompileNumDynamicShapes=0, Tguaranteed_constants=[], function=cluster_14266836224844131312[], metadata="\n\006\010...7\323\224M", num_computations=1, _device="/job:tpu_worker/replica:0/task:0/device:CPU:0"]] [[{{node training/transformer/body/decoder/layer_3/self_attention/layer_prepostprocess/layer_norm/layer_norm_scale/Adafactor/read/_67_G943}} = Recvclient_terminated=false, recv_device="/job:tpu_worker/replica:0/task:0/device:TPU:0", send_device="/job:tpu_worker/replica:0/task:0/device:CPU:0", send_device_incarnation=-8638422706644811085, tensor_name="edge_1039...r/read/_67", tensor_type=DT_FLOAT, _device="/job:tpu_worker/replica:0/task:0/device:TPU:0"]]

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/tensorflow/tensor2tensor/issues/1202, or mute the thread https://github.com/notifications/unsubscribe-auth/ABFCFviDqz0u7-pYZgEzhne2JdBia9wfks5usIG7gaJpZM4YO_q6 .

-- We are here on earth to do good for others. What the others are here for, I don't know.

Afroz Mohiuddin

afrozenator avatar Nov 05 '18 18:11 afrozenator

Same problem.

BogdanDidenko avatar Nov 13 '18 18:11 BogdanDidenko

This still seems to be a problem in T2T 1.12.0. Is this the case internally as well? @afrozenator

mikeymezher avatar Feb 06 '19 20:02 mikeymezher

The same here, it's able to train but it fails to reinitialize train after evaluation. Trace:

INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Init TPU system
INFO:tensorflow:Initialized TPU in 7 seconds
INFO:tensorflow:Starting infeed thread controller.
INFO:tensorflow:Initialized dataset iterators in 0 seconds
INFO:tensorflow:Starting outfeed thread controller.
INFO:tensorflow:Enqueue next (100) batch(es) of data to infeed.
INFO:tensorflow:Dequeue next (100) batch(es) of data from outfeed.
INFO:tensorflow:Evaluation [100/100]
INFO:tensorflow:Stop infeed thread controller
INFO:tensorflow:Shutting down InfeedController thread.
INFO:tensorflow:InfeedController received shutdown signal, stopping.
INFO:tensorflow:Infeed thread finished, shutting down.
INFO:tensorflow:infeed marked as finished
INFO:tensorflow:Stop output thread controller
INFO:tensorflow:Shutting down OutfeedController thread.
INFO:tensorflow:OutfeedController received shutdown signal, stopping.
INFO:tensorflow:Outfeed thread finished, shutting down.
INFO:tensorflow:outfeed marked as finished
INFO:tensorflow:Shutdown TPU system.
INFO:tensorflow:Finished evaluation at 2019-02-10-19:23:16
INFO:tensorflow:Saving dict for global step 100: global_step = 100, loss = 7.272978, mae = 2.8202152
INFO:tensorflow:Saving 'checkpoint_path' summary for global step 100: gs://xxxx/model.ckpt-100
INFO:tensorflow:evaluation_loop marked as finished
INFO:tensorflow:loss = 4.7624865, step = 100
INFO:tensorflow:Enqueue next (100) batch(es) of data to infeed.
INFO:tensorflow:Dequeue next (100) batch(es) of data from outfeed.
INFO:tensorflow:Error recorded from training_loop: Combined status information from 9 operations:

Status code: Failed precondition [9x]
  The TPU system has not been initialized.
  	 [[node TPUReplicateMetadata (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py:1171) ]] [1x]
  The TPU system has not been initialized.
  	 [[node TPUReplicateMetadata (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py:1171) ]]
  	 [[{{node cluster/_variable_copy/_980}}]] [3x]
  The TPU system has not been initialized.
  	 [[node TPUReplicateMetadata (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py:1171) ]]
  	 [[{{node cluster/_variable_copy/_980}}]]
  	 [[{{node cluster/_variable_copy/_980}}]] [5x]
(0 successful operations.)

edumotya avatar Feb 10 '19 21:02 edumotya

Has there been any progress on this issue? I think I might be encountering the same problem, except while using InMemoryEvaluatorHook: https://stackoverflow.com/questions/55071546/using-inmemoryevaluatorhook-with-tpu-throws-exception

dapatil211 avatar Mar 13 '19 21:03 dapatil211

Same with tf-gpu 1.13.1, t2t@3b38635f12348036ea1e0166857f43a6b971ab07, tpu v3, same code trains on a gpu without error, while running schedule "train_and_evaluate".

cwbeitel avatar Mar 21 '19 19:03 cwbeitel

Is there any progress with this issue?

emukans avatar Nov 06 '19 20:11 emukans

Getting the similar errors with tpu 1.13, while trying to implement a CheckpointSaverHook doing the same thing.

jpark621 avatar Nov 15 '19 21:11 jpark621

I have the same problem.

At the end of evaluation session, the TPUInfeedOutfeedSessionHook calls a routine to shutdown TPU system. However, since the training session is not re-initialized (it will only be called once if you include evaluation in save_listener or hook), TPU system is not re-initialized.

wjxgeorge avatar Jul 27 '20 23:07 wjxgeorge