xlnet icon indicating copy to clipboard operation
xlnet copied to clipboard

ValueError: Outputs of true_fn and false_fn must have the same type: int64, bool

Open yana-xuyan opened this issue 5 years ago • 4 comments

Hi, I meet this error when trying to fine-tuning on SQuAD following #64. Seems there's something wrong on MirroredStrategy. Part of the log is attached below.

I0712 21:31:33.852550 140123963430272 cross_device_ops.py:646] batch_all_reduce invoked for batches size = 177 with algorithm = nccl, num_packs = 1, agg_small_grads_max_bytes = 0 and agg_small_grads_max_group = 10 I0712 21:31:43.480206 140109609428736 coordinator.py:219] Error reported to Coordinator: Outputs of true_fn and false_fn must have the same type: int64, bool Traceback (most recent call last): File "/home/xuyan/anaconda3/envs/xlnet/lib/python2.7/site-packages/tensorflow/python/training/coordinator.py", line 297, in stop_on_exception yield File "/home/xuyan/anaconda3/envs/xlnet/lib/python2.7/site-packages/tensorflow/python/distribute/mirrored_strategy.py", line 852, in run self.main_result = self.main_fn(*self.main_args, **self.main_kwargs) File "/home/xuyan/anaconda3/envs/xlnet/lib/python2.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1112, in _call_model_fn model_fn_results = self._model_fn(features=features, **kwargs) File "run_squad_GPU.py", line 1084, in model_fn train_op, learning_rate, _ = model_utils.get_train_op(FLAGS, total_loss) File "/home/xuyan/mrqa/xlnet-qa/model_utils_GPU.py", line 194, in get_train_op list(zip(clipped, variables)), global_step=global_step) File "/home/xuyan/anaconda3/envs/xlnet/lib/python2.7/site-packages/tensorflow/contrib/mixed_precision/python/loss_scale_optimizer.py", line 158, in apply_gradients is_overall_finite, true_apply_gradients_fn, gen_control_flow_ops.no_op) File "/home/xuyan/anaconda3/envs/xlnet/lib/python2.7/site-packages/tensorflow/python/util/deprecation.py", line 507, in new_func return func(*args, **kwargs) File "/home/xuyan/anaconda3/envs/xlnet/lib/python2.7/site-packages/tensorflow/python/ops/control_flow_ops.py", line 2147, in cond (val_x.dtype.name, val_y.dtype.name)) ValueError: Outputs of true_fn and false_fn must have the same type: int64, bool Traceback (most recent call last): File "run_squad_GPU.py", line 1310, in tf.app.run() File "/home/xuyan/anaconda3/envs/xlnet/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 125, in run _sys.exit(main(argv)) File "run_squad_GPU.py", line 1209, in main estimator.train(input_fn=train_input_fn, max_steps=FLAGS.train_steps) File "/home/xuyan/anaconda3/envs/xlnet/lib/python2.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 358, in train loss = self._train_model(input_fn, hooks, saving_listeners) File "/home/xuyan/anaconda3/envs/xlnet/lib/python2.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1122, in _train_model return self._train_model_distributed(input_fn, hooks, saving_listeners) File "/home/xuyan/anaconda3/envs/xlnet/lib/python2.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1185, in _train_model_distributed self._config._train_distribute, input_fn, hooks, saving_listeners) File "/home/xuyan/anaconda3/envs/xlnet/lib/python2.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1254, in _actual_train_model_distributed self.config)) File "/home/xuyan/anaconda3/envs/xlnet/lib/python2.7/site-packages/tensorflow/python/distribute/distribute_lib.py", line 1199, in call_for_each_replica return self._call_for_each_replica(fn, args, kwargs) File "/home/xuyan/anaconda3/envs/xlnet/lib/python2.7/site-packages/tensorflow/python/distribute/mirrored_strategy.py", line 641, in _call_for_each_replica return _call_for_each_replica(self._container_strategy(), fn, args, kwargs) File "/home/xuyan/anaconda3/envs/xlnet/lib/python2.7/site-packages/tensorflow/python/distribute/mirrored_strategy.py", line 189, in _call_for_each_replica coord.join(threads) File "/home/xuyan/anaconda3/envs/xlnet/lib/python2.7/site-packages/tensorflow/python/training/coordinator.py", line 389, in join six.reraise(*self._exc_info_to_raise) File "/home/xuyan/anaconda3/envs/xlnet/lib/python2.7/site-packages/tensorflow/python/training/coordinator.py", line 297, in stop_on_exception yield File "/home/xuyan/anaconda3/envs/xlnet/lib/python2.7/site-packages/tensorflow/python/distribute/mirrored_strategy.py", line 852, in run self.main_result = self.main_fn(*self.main_args, **self.main_kwargs) File "/home/xuyan/anaconda3/envs/xlnet/lib/python2.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1112, in _call_model_fn model_fn_results = self._model_fn(features=features, **kwargs) File "run_squad_GPU.py", line 1084, in model_fn train_op, learning_rate, _ = model_utils.get_train_op(FLAGS, total_loss) File "/home/xuyan/mrqa/xlnet-qa/model_utils_GPU.py", line 194, in get_train_op list(zip(clipped, variables)), global_step=global_step) File "/home/xuyan/anaconda3/envs/xlnet/lib/python2.7/site-packages/tensorflow/contrib/mixed_precision/python/loss_scale_optimizer.py", line 158, in apply_gradients is_overall_finite, true_apply_gradients_fn, gen_control_flow_ops.no_op) File "/home/xuyan/anaconda3/envs/xlnet/lib/python2.7/site-packages/tensorflow/python/util/deprecation.py", line 507, in new_func return func(*args, **kwargs) File "/home/xuyan/anaconda3/envs/xlnet/lib/python2.7/site-packages/tensorflow/python/ops/control_flow_ops.py", line 2147, in cond (val_x.dtype.name, val_y.dtype.name)) ValueError: Outputs of true_fn and false_fn must have the same type: int64, bool

My tensorflow version is 1.13.1 and python 2.7 Could someone kindly give me some suggestions about how to solve this problem?

yana-xuyan avatar Jul 12 '19 13:07 yana-xuyan

hi,have you found a solution? I also encountered the same problem, when I changed float32 to 16

qdchenxiaoyan avatar Nov 10 '20 03:11 qdchenxiaoyan

hi,have you found a solution? I also encountered the same problem, when I changed float32 to 16

Hi hi, I solved this issue before, but it has been a long time ago, so I don't really remember the detail. I just remember I changed the data type of one variable in the code.

yana-xuyan avatar Nov 10 '20 04:11 yana-xuyan

Which variable? Or from what data type to what data type? Can you still remember when you think about it?Thanks a lot.

qdchenxiaoyan avatar Nov 10 '20 07:11 qdchenxiaoyan

hi,have you found a solution? I also encountered the same problem, when I changed float32 to 16

Hi hi, I solved this issue before, but it has been a long time ago, so I don't really remember the detail. I just remember I changed the data type of one variable in the code.

clipped, gnorm = tf.clip_by_global_norm( grads, clip_norm=1.0, use_norm=tf.cond( all_are_finite, lambda: tf.global_norm(grads), lambda: tf.constant(1.0))) is here?

qdchenxiaoyan avatar Nov 10 '20 07:11 qdchenxiaoyan