tensor2tensor The transformer big of T2T 1.15.7 will report OOM error when it runs multi GPU training in Tensorflow 2.2.0

The transformer big of T2T 1.15.7 will report OOM error when it runs multi GPU training in Tensorflow 2.2.0

Open Likede15 opened this issue 3 years ago • 4 comments

Description

We customize the translation problem and use our own dictionary. When setting worker_ gpu=8 batch_ size=1024 model=transformer_ big, OOM error occurs during training.

Some of the error messages are as follows:

tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
  (0) Internal: Dst tensor is not initialized.
	 [[node transformer/parallel_3_5/transformer/transformer/body/encoder/layer_3/ffn/layer_prepostprocess/layer_norm/sub (defined at data/likede/workspace/tensor2tensor-master/tensor2tensor/layers/common_layers.py:707) ]]
	 [[add_1/_22391]]
  (1) Internal: Dst tensor is not initialized.
	 [[node transformer/parallel_3_5/transformer/transformer/body/encoder/layer_3/ffn/layer_prepostprocess/layer_norm/sub (defined at data/likede/workspace/tensor2tensor-master/tensor2tensor/layers/common_layers.py:707) ]]
0 successful operations.
7 derived errors ignored.

Environment information

GPU environment is 8 x 2080Ti 11G memory 

OS: ubuntu 16.04

$ pip freeze | grep tensor
tensor2tensor==1.15.7
tensorboard==2.2.2
tensorboard-plugin-wit==1.7.0
tensorflow==2.2.0
tensorflow-addons==0.11.2
tensorflow-datasets==4.0.1
tensorflow-estimator==2.2.0
tensorflow-gan==2.0.0
tensorflow-hub==0.9.0
tensorflow-metadata==0.24.0
tensorflow-probability==0.7.0

$ python -V
Python 3.8.5

For bugs: reproduction and error logs

# Steps to reproduce:
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7

t2t-trainer  
  --problem=translate_src_tgt  
  --model=transformer 
  --t2t_usr_dir=user_dir 
  --hparams_set=transformer_big  
  --train_steps=200000 
  --eval_steps=100 
  --data_dir=data 
  --output_dir=model 
  --hparams="batch_size=1024" 
  --worker_gpu=8

# Error logs:
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow_gan/python/estimator/tpu_gan_estimator.py:42: The name tf.estimator.tpu.TPUEstimator is deprecated. Please use tf.compat.v1.estimator.tpu.TPUEstimator instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow_gan/python/estimator/tpu_gan_estimator.py:42: The name tf.estimator.tpu.TPUEstimator is deprecated. Please use tf.compat.v1.estimator.tpu.TPUEstimator instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/compat/v2_compat.py:96: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/compat/v2_compat.py:96: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
INFO:tensorflow:Importing user module user_dir from path /data/likede/workspace/t2t/translate_source_target
I1029 09:45:03.234566 140162062960448 usr_dir.py:43] Importing user module user_dir from path /data/likede/workspace/t2t/translate_source_target
INFO:tensorflow:Overriding hparams in transformer_big with batch_size=1024
I1029 09:45:03.235566 140162062960448 hparams_lib.py:55] Overriding hparams in transformer_big with batch_size=1024
INFO:tensorflow:Configuring DataParallelism to replicate the model.
I1029 09:45:03.236984 140162062960448 trainer_lib.py:271] Configuring DataParallelism to replicate the model.
INFO:tensorflow:schedule=continuous_train_and_eval
I1029 09:45:03.237099 140162062960448 devices.py:76] schedule=continuous_train_and_eval
INFO:tensorflow:worker_gpu=8
I1029 09:45:03.237154 140162062960448 devices.py:77] worker_gpu=8
INFO:tensorflow:sync=False
I1029 09:45:03.237202 140162062960448 devices.py:78] sync=False
WARNING:tensorflow:Schedule=continuous_train_and_eval. Assuming that training is running on a single machine.
W1029 09:45:03.237266 140162062960448 devices.py:141] Schedule=continuous_train_and_eval. Assuming that training is running on a single machine.
INFO:tensorflow:datashard_devices: ['gpu:0', 'gpu:1', 'gpu:2', 'gpu:3', 'gpu:4', 'gpu:5', 'gpu:6', 'gpu:7']
I1029 09:45:03.237857 140162062960448 devices.py:170] datashard_devices: ['gpu:0', 'gpu:1', 'gpu:2', 'gpu:3', 'gpu:4', 'gpu:5', 'gpu:6', 'gpu:7']
INFO:tensorflow:caching_devices: None
I1029 09:45:03.238245 140162062960448 devices.py:171] caching_devices: None
INFO:tensorflow:ps_devices: ['gpu:0', 'gpu:1', 'gpu:2', 'gpu:3', 'gpu:4', 'gpu:5', 'gpu:6', 'gpu:7']
I1029 09:45:03.238616 140162062960448 devices.py:172] ps_devices: ['gpu:0', 'gpu:1', 'gpu:2', 'gpu:3', 'gpu:4', 'gpu:5', 'gpu:6', 'gpu:7']
INFO:tensorflow:Using config: {'_model_dir': '/data/likede/workspace/t2t/translate_source_target/model8_big_docker', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': 1000, '_save_checkpoints_secs': None, '_session_config': gpu_options {
  per_process_gpu_memory_fraction: 0.95
  allow_growth: true
}
allow_soft_placement: true
graph_options {
  optimizer_options {
    global_jit_level: OFF
  }
}
isolate_session_state: true
, '_keep_checkpoint_max': 20, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': ClusterSpec({}), '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1, 'use_tpu': False, 't2t_device_info': {'num_async_replicas': 1}, 'data_parallelism': <tensor2tensor.utils.expert_utils.Parallelism object at 0x7f786f27d860>}
I1029 09:45:03.442313 140162062960448 estimator.py:191] Using config: {'_model_dir': '/data/likede/workspace/t2t/translate_source_target/model8_big_docker', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': 1000, '_save_checkpoints_secs': None, '_session_config': gpu_options {
  per_process_gpu_memory_fraction: 0.95
  allow_growth: true
}
allow_soft_placement: true
graph_options {
  optimizer_options {
    global_jit_level: OFF
  }
}
isolate_session_state: true
, '_keep_checkpoint_max': 20, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': ClusterSpec({}), '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1, 'use_tpu': False, 't2t_device_info': {'num_async_replicas': 1}, 'data_parallelism': <tensor2tensor.utils.expert_utils.Parallelism object at 0x7f786f27d860>}
WARNING:tensorflow:Estimator's model_fn (<function T2TModel.make_estimator_model_fn.<locals>.wrapping_model_fn at 0x7f786f27f730>) includes params argument, but params are not passed to Estimator.
W1029 09:45:03.442584 140162062960448 model_fn.py:617] Estimator's model_fn (<function T2TModel.make_estimator_model_fn.<locals>.wrapping_model_fn at 0x7f786f27f730>) includes params argument, but params are not passed to Estimator.
WARNING:tensorflow:ValidationMonitor only works with --schedule=train_and_evaluate
W1029 09:45:03.442720 140162062960448 trainer_lib.py:795] ValidationMonitor only works with --schedule=train_and_evaluate
INFO:tensorflow:Not using Distribute Coordinator.
I1029 09:45:03.444490 140162062960448 estimator_training.py:186] Not using Distribute Coordinator.
INFO:tensorflow:Running training and evaluation locally (non-distributed).
I1029 09:45:03.444776 140162062960448 training.py:612] Running training and evaluation locally (non-distributed).
INFO:tensorflow:Start train and evaluate loop. The evaluate will happen after every checkpoint. Checkpoint frequency is determined based on RunConfig arguments: save_checkpoints_steps 1000 or save_checkpoints_secs None.
I1029 09:45:03.445080 140162062960448 training.py:700] Start train and evaluate loop. The evaluate will happen after every checkpoint. Checkpoint frequency is determined based on RunConfig arguments: save_checkpoints_steps 1000 or save_checkpoints_secs None.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/training/training_util.py:236: Variable.initialized_value (from tensorflow.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Use Variable.read_value. Variables in 2.X are initialized automatically both in eager and graph (inside tf.defun) contexts.
W1029 09:45:03.451545 140162062960448 deprecation.py:323] From /usr/local/lib/python3.6/dist-packages/tensorflow/python/training/training_util.py:236: Variable.initialized_value (from tensorflow.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Use Variable.read_value. Variables in 2.X are initialized automatically both in eager and graph (inside tf.defun) contexts.
INFO:tensorflow:Reading data files from /data/likede/workspace/t2t/translate_source_target/data/translate_src_tgt-train*
I1029 09:45:03.459867 140162062960448 problem.py:653] Reading data files from /data/likede/workspace/t2t/translate_source_target/data/translate_src_tgt-train*
INFO:tensorflow:partition: 0 num_data_files: 100
I1029 09:45:03.461889 140162062960448 problem.py:679] partition: 0 num_data_files: 100
WARNING:tensorflow:From /data/likede/workspace/tensor2tensor-master/tensor2tensor/data_generators/problem.py:689: parallel_interleave (from tensorflow.python.data.experimental.ops.interleave_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.Dataset.interleave(map_func, cycle_length, block_length, num_parallel_calls=tf.data.experimental.AUTOTUNE)` instead. If sloppy execution is desired, use `tf.data.Options.experimental_deterministic`.
W1029 09:45:03.465020 140162062960448 deprecation.py:323] From /data/likede/workspace/tensor2tensor-master/tensor2tensor/data_generators/problem.py:689: parallel_interleave (from tensorflow.python.data.experimental.ops.interleave_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.Dataset.interleave(map_func, cycle_length, block_length, num_parallel_calls=tf.data.experimental.AUTOTUNE)` instead. If sloppy execution is desired, use `tf.data.Options.experimental_deterministic`.
WARNING:tensorflow:From /data/likede/workspace/tensor2tensor-master/tensor2tensor/utils/data_reader.py:276: tf_record_iterator (from tensorflow.python.lib.io.tf_record) is deprecated and will be removed in a future version.
Instructions for updating:
Use eager execution and: 
`tf.data.TFRecordDataset(path)`
W1029 09:45:03.515387 140162062960448 deprecation.py:323] From /data/likede/workspace/tensor2tensor-master/tensor2tensor/utils/data_reader.py:276: tf_record_iterator (from tensorflow.python.lib.io.tf_record) is deprecated and will be removed in a future version.
Instructions for updating:
Use eager execution and: 
`tf.data.TFRecordDataset(path)`
WARNING:tensorflow:From /data/likede/workspace/tensor2tensor-master/tensor2tensor/utils/data_reader.py:38: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.cast` instead.
W1029 09:45:03.829415 140162062960448 deprecation.py:323] From /data/likede/workspace/tensor2tensor-master/tensor2tensor/utils/data_reader.py:38: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.cast` instead.
WARNING:tensorflow:From /data/likede/workspace/tensor2tensor-master/tensor2tensor/utils/data_reader.py:234: to_float (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.cast` instead.
W1029 09:45:04.726016 140162062960448 deprecation.py:323] From /data/likede/workspace/tensor2tensor-master/tensor2tensor/utils/data_reader.py:234: to_float (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.cast` instead.
INFO:tensorflow:Calling model_fn.
I1029 09:45:05.002332 140162062960448 estimator.py:1169] Calling model_fn.
INFO:tensorflow:Unsetting shared_embedding_and_softmax_weights.
I1029 09:45:05.022548 140162062960448 t2t_model.py:2267] Unsetting shared_embedding_and_softmax_weights.
INFO:tensorflow:Setting T2TModel mode to 'train'
I1029 09:45:05.022719 140162062960448 t2t_model.py:2267] Setting T2TModel mode to 'train'
INFO:tensorflow:Using variable initializer: uniform_unit_scaling
I1029 09:45:06.043050 140162062960448 api.py:348] Using variable initializer: uniform_unit_scaling
INFO:tensorflow:Transforming feature 'inputs' with symbol_modality_50003_1024.bottom
I1029 09:45:15.544066 140162062960448 api.py:348] Transforming feature 'inputs' with symbol_modality_50003_1024.bottom
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/resource_variable_ops.py:1666: calling BaseResourceVariable.__init__ (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version.
Instructions for updating:
If using Keras pass *_constraint arguments to layers.
W1029 09:45:16.043759 140162062960448 deprecation.py:506] From /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/resource_variable_ops.py:1666: calling BaseResourceVariable.__init__ (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version.
Instructions for updating:
If using Keras pass *_constraint arguments to layers.
INFO:tensorflow:Transforming feature 'targets' with symbol_modality_40003_1024.targets_bottom
I1029 09:45:18.140442 140162062960448 api.py:348] Transforming feature 'targets' with symbol_modality_40003_1024.targets_bottom
INFO:tensorflow:Building model body
I1029 09:45:18.376040 140162062960448 api.py:348] Building model body
WARNING:tensorflow:From /data/likede/workspace/tensor2tensor-master/tensor2tensor/models/transformer.py:95: calling dropout (from tensorflow.python.ops.nn_ops) with keep_prob is deprecated and will be removed in a future version.
Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.
W1029 09:45:20.954322 140162062960448 deprecation.py:506] From /data/likede/workspace/tensor2tensor-master/tensor2tensor/models/transformer.py:95: calling dropout (from tensorflow.python.ops.nn_ops) with keep_prob is deprecated and will be removed in a future version.
Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.
INFO:tensorflow:Transforming body output with symbol_modality_40003_1024.top
I1029 09:45:38.670051 140162062960448 api.py:348] Transforming body output with symbol_modality_40003_1024.top
INFO:tensorflow:Transforming feature 'inputs' with symbol_modality_50003_1024.bottom
I1029 09:45:43.860683 140162062960448 api.py:348] Transforming feature 'inputs' with symbol_modality_50003_1024.bottom
INFO:tensorflow:Transforming feature 'targets' with symbol_modality_40003_1024.targets_bottom
I1029 09:45:43.890336 140162062960448 api.py:348] Transforming feature 'targets' with symbol_modality_40003_1024.targets_bottom
INFO:tensorflow:Building model body
I1029 09:45:43.918007 140162062960448 api.py:348] Building model body
INFO:tensorflow:Transforming body output with symbol_modality_40003_1024.top
I1029 09:45:48.136068 140162062960448 api.py:348] Transforming body output with symbol_modality_40003_1024.top
INFO:tensorflow:Transforming feature 'inputs' with symbol_modality_50003_1024.bottom
I1029 09:45:48.247980 140162062960448 api.py:348] Transforming feature 'inputs' with symbol_modality_50003_1024.bottom
INFO:tensorflow:Transforming feature 'targets' with symbol_modality_40003_1024.targets_bottom
I1029 09:45:48.277388 140162062960448 api.py:348] Transforming feature 'targets' with symbol_modality_40003_1024.targets_bottom
INFO:tensorflow:Building model body
I1029 09:45:48.305373 140162062960448 api.py:348] Building model body
INFO:tensorflow:Transforming body output with symbol_modality_40003_1024.top
I1029 09:45:52.576835 140162062960448 api.py:348] Transforming body output with symbol_modality_40003_1024.top
INFO:tensorflow:Transforming feature 'inputs' with symbol_modality_50003_1024.bottom
I1029 09:45:52.688709 140162062960448 api.py:348] Transforming feature 'inputs' with symbol_modality_50003_1024.bottom
INFO:tensorflow:Transforming feature 'targets' with symbol_modality_40003_1024.targets_bottom
I1029 09:45:52.718088 140162062960448 api.py:348] Transforming feature 'targets' with symbol_modality_40003_1024.targets_bottom
INFO:tensorflow:Building model body
I1029 09:45:52.745728 140162062960448 api.py:348] Building model body
INFO:tensorflow:Transforming body output with symbol_modality_40003_1024.top
I1029 09:45:56.801454 140162062960448 api.py:348] Transforming body output with symbol_modality_40003_1024.top
INFO:tensorflow:Transforming feature 'inputs' with symbol_modality_50003_1024.bottom
I1029 09:45:56.914181 140162062960448 api.py:348] Transforming feature 'inputs' with symbol_modality_50003_1024.bottom
INFO:tensorflow:Transforming feature 'targets' with symbol_modality_40003_1024.targets_bottom
I1029 09:45:56.943332 140162062960448 api.py:348] Transforming feature 'targets' with symbol_modality_40003_1024.targets_bottom
INFO:tensorflow:Building model body
I1029 09:45:56.971235 140162062960448 api.py:348] Building model body
INFO:tensorflow:Transforming body output with symbol_modality_40003_1024.top
I1029 09:46:01.049882 140162062960448 api.py:348] Transforming body output with symbol_modality_40003_1024.top
INFO:tensorflow:Transforming feature 'inputs' with symbol_modality_50003_1024.bottom
I1029 09:46:01.165864 140162062960448 api.py:348] Transforming feature 'inputs' with symbol_modality_50003_1024.bottom
INFO:tensorflow:Transforming feature 'targets' with symbol_modality_40003_1024.targets_bottom
I1029 09:46:01.195513 140162062960448 api.py:348] Transforming feature 'targets' with symbol_modality_40003_1024.targets_bottom
INFO:tensorflow:Building model body
I1029 09:46:01.223802 140162062960448 api.py:348] Building model body
INFO:tensorflow:Transforming body output with symbol_modality_40003_1024.top
I1029 09:46:05.581874 140162062960448 api.py:348] Transforming body output with symbol_modality_40003_1024.top
INFO:tensorflow:Transforming feature 'inputs' with symbol_modality_50003_1024.bottom
I1029 09:46:05.695366 140162062960448 api.py:348] Transforming feature 'inputs' with symbol_modality_50003_1024.bottom
INFO:tensorflow:Transforming feature 'targets' with symbol_modality_40003_1024.targets_bottom
I1029 09:46:05.724762 140162062960448 api.py:348] Transforming feature 'targets' with symbol_modality_40003_1024.targets_bottom
INFO:tensorflow:Building model body
I1029 09:46:05.752890 140162062960448 api.py:348] Building model body
INFO:tensorflow:Transforming body output with symbol_modality_40003_1024.top
I1029 09:46:09.876226 140162062960448 api.py:348] Transforming body output with symbol_modality_40003_1024.top
INFO:tensorflow:Transforming feature 'inputs' with symbol_modality_50003_1024.bottom
I1029 09:46:09.989799 140162062960448 api.py:348] Transforming feature 'inputs' with symbol_modality_50003_1024.bottom
INFO:tensorflow:Transforming feature 'targets' with symbol_modality_40003_1024.targets_bottom
I1029 09:46:10.019273 140162062960448 api.py:348] Transforming feature 'targets' with symbol_modality_40003_1024.targets_bottom
INFO:tensorflow:Building model body
I1029 09:46:10.047407 140162062960448 api.py:348] Building model body
INFO:tensorflow:Transforming body output with symbol_modality_40003_1024.top
I1029 09:46:14.447000 140162062960448 api.py:348] Transforming body output with symbol_modality_40003_1024.top
INFO:tensorflow:Base learning rate: 2.000000
I1029 09:46:15.684860 140162062960448 learning_rate.py:29] Base learning rate: 2.000000
INFO:tensorflow:Trainable Variables Total size: 309449728
I1029 09:46:15.694217 140162062960448 optimize.py:355] Trainable Variables Total size: 309449728
INFO:tensorflow:Non-trainable variables Total size: 5
I1029 09:46:15.694609 140162062960448 optimize.py:355] Non-trainable variables Total size: 5
INFO:tensorflow:Using optimizer adam
I1029 09:46:15.694975 140162062960448 optimize.py:200] Using optimizer adam
INFO:tensorflow:Done calling model_fn.
I1029 09:46:45.754957 140162062960448 estimator.py:1171] Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
I1029 09:46:45.756235 140162062960448 basic_session_run_hooks.py:546] Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
I1029 09:46:57.330067 140162062960448 monitored_session.py:246] Graph was finalized.
2020-10-29 09:46:57.330509: I tensorflow/core/platform/cpu_feature_guard.cc:143] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
2020-10-29 09:46:57.349397: I tensorflow/core/platform/profile_utils/cpu_utils.cc:102] CPU Frequency: 2500000000 Hz
2020-10-29 09:46:57.362218: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7f7690000b20 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-10-29 09:46:57.362255: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-10-29 09:46:57.365030: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-10-29 09:46:58.380824: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-10-29 09:46:58.430933: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-10-29 09:46:58.498518: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-10-29 09:46:58.565079: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-10-29 09:46:58.577249: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-10-29 09:46:58.592815: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-10-29 09:46:58.610921: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-10-29 09:46:58.631450: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-10-29 09:46:58.632869: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7f74b8000b20 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2020-10-29 09:46:58.632895: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): GeForce RTX 2080 Ti, Compute Capability 7.5
2020-10-29 09:46:58.632903: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (1): GeForce RTX 2080 Ti, Compute Capability 7.5
2020-10-29 09:46:58.632910: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (2): GeForce RTX 2080 Ti, Compute Capability 7.5
2020-10-29 09:46:58.632915: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (3): GeForce RTX 2080 Ti, Compute Capability 7.5
2020-10-29 09:46:58.632920: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (4): GeForce RTX 2080 Ti, Compute Capability 7.5
2020-10-29 09:46:58.632926: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (5): GeForce RTX 2080 Ti, Compute Capability 7.5
2020-10-29 09:46:58.632931: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (6): GeForce RTX 2080 Ti, Compute Capability 7.5
2020-10-29 09:46:58.632937: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (7): GeForce RTX 2080 Ti, Compute Capability 7.5
2020-10-29 09:46:58.636277: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-10-29 09:46:58.637281: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 0 with properties: 
pciBusID: 0000:00:08.0 name: GeForce RTX 2080 Ti computeCapability: 7.5
coreClock: 1.545GHz coreCount: 68 deviceMemorySize: 10.76GiB deviceMemoryBandwidth: 573.69GiB/s
2020-10-29 09:46:58.637358: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-10-29 09:46:58.638322: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 1 with properties: 
pciBusID: 0000:00:09.0 name: GeForce RTX 2080 Ti computeCapability: 7.5
coreClock: 1.545GHz coreCount: 68 deviceMemorySize: 10.76GiB deviceMemoryBandwidth: 573.69GiB/s
2020-10-29 09:46:58.638377: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-10-29 09:46:58.639339: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 2 with properties: 
pciBusID: 0000:00:0a.0 name: GeForce RTX 2080 Ti computeCapability: 7.5
coreClock: 1.545GHz coreCount: 68 deviceMemorySize: 10.76GiB deviceMemoryBandwidth: 573.69GiB/s
2020-10-29 09:46:58.639391: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-10-29 09:46:58.640343: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 3 with properties: 
pciBusID: 0000:00:0b.0 name: GeForce RTX 2080 Ti computeCapability: 7.5
coreClock: 1.545GHz coreCount: 68 deviceMemorySize: 10.76GiB deviceMemoryBandwidth: 573.69GiB/s
2020-10-29 09:46:58.640393: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-10-29 09:46:58.641361: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 4 with properties: 
pciBusID: 0000:00:0c.0 name: GeForce RTX 2080 Ti computeCapability: 7.5
coreClock: 1.545GHz coreCount: 68 deviceMemorySize: 10.76GiB deviceMemoryBandwidth: 573.69GiB/s
2020-10-29 09:46:58.641416: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-10-29 09:46:58.642375: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 5 with properties: 
pciBusID: 0000:00:0d.0 name: GeForce RTX 2080 Ti computeCapability: 7.5
coreClock: 1.545GHz coreCount: 68 deviceMemorySize: 10.76GiB deviceMemoryBandwidth: 573.69GiB/s
2020-10-29 09:46:58.642424: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-10-29 09:46:58.643378: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 6 with properties: 
pciBusID: 0000:00:0e.0 name: GeForce RTX 2080 Ti computeCapability: 7.5
coreClock: 1.545GHz coreCount: 68 deviceMemorySize: 10.76GiB deviceMemoryBandwidth: 573.69GiB/s
2020-10-29 09:46:58.643425: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-10-29 09:46:58.644373: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 7 with properties: 
pciBusID: 0000:00:0f.0 name: GeForce RTX 2080 Ti computeCapability: 7.5
coreClock: 1.545GHz coreCount: 68 deviceMemorySize: 10.76GiB deviceMemoryBandwidth: 573.69GiB/s
2020-10-29 09:46:58.644663: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-10-29 09:46:58.646339: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-10-29 09:46:58.648016: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-10-29 09:46:58.648927: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-10-29 09:46:58.650626: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-10-29 09:46:58.651417: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-10-29 09:46:58.655083: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-10-29 09:46:58.655194: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-10-29 09:46:58.656233: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-10-29 09:46:58.657228: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-10-29 09:46:58.658252: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-10-29 09:46:58.659248: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-10-29 09:46:58.660240: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-10-29 09:46:58.661228: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-10-29 09:46:58.662236: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-10-29 09:46:58.663225: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-10-29 09:46:58.664210: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-10-29 09:46:58.665196: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-10-29 09:46:58.666196: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-10-29 09:46:58.667183: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-10-29 09:46:58.668169: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-10-29 09:46:58.669156: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-10-29 09:46:58.670159: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-10-29 09:46:58.671111: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1703] Adding visible gpu devices: 0, 1, 2, 3, 4, 5, 6, 7
2020-10-29 09:46:58.671175: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-10-29 09:46:58.680526: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-10-29 09:46:58.680550: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1108]      0 1 2 3 4 5 6 7 
2020-10-29 09:46:58.680559: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1121] 0:   N N N N N N N N 
2020-10-29 09:46:58.680564: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1121] 1:   N N N N N N N N 
2020-10-29 09:46:58.680568: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1121] 2:   N N N N N N N N 
2020-10-29 09:46:58.680579: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1121] 3:   N N N N N N N N 
2020-10-29 09:46:58.680585: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1121] 4:   N N N N N N N N 
2020-10-29 09:46:58.680590: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1121] 5:   N N N N N N N N 
2020-10-29 09:46:58.680595: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1121] 6:   N N N N N N N N 
2020-10-29 09:46:58.680600: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1121] 7:   N N N N N N N N 
2020-10-29 09:46:58.680861: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-10-29 09:46:58.681890: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-10-29 09:46:58.682891: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-10-29 09:46:58.683882: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-10-29 09:46:58.684876: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-10-29 09:46:58.685881: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-10-29 09:46:58.686875: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-10-29 09:46:58.687865: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-10-29 09:46:58.688859: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-10-29 09:46:58.689861: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1247] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10468 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2080 Ti, pci bus id: 0000:00:08.0, compute capability: 7.5)
2020-10-29 09:46:58.690314: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-10-29 09:46:58.691385: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1247] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 10468 MB memory) -> physical GPU (device: 1, name: GeForce RTX 2080 Ti, pci bus id: 0000:00:09.0, compute capability: 7.5)
2020-10-29 09:46:58.692294: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-10-29 09:46:58.693333: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1247] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:2 with 10468 MB memory) -> physical GPU (device: 2, name: GeForce RTX 2080 Ti, pci bus id: 0000:00:0a.0, compute capability: 7.5)
2020-10-29 09:46:58.693639: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-10-29 09:46:58.694663: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1247] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:3 with 10468 MB memory) -> physical GPU (device: 3, name: GeForce RTX 2080 Ti, pci bus id: 0000:00:0b.0, compute capability: 7.5)
2020-10-29 09:46:58.695576: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-10-29 09:46:58.696606: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1247] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:4 with 10468 MB memory) -> physical GPU (device: 4, name: GeForce RTX 2080 Ti, pci bus id: 0000:00:0c.0, compute capability: 7.5)
2020-10-29 09:46:58.697506: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-10-29 09:46:58.698516: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1247] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:5 with 10468 MB memory) -> physical GPU (device: 5, name: GeForce RTX 2080 Ti, pci bus id: 0000:00:0d.0, compute capability: 7.5)
2020-10-29 09:46:58.698872: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-10-29 09:46:58.699911: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1247] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:6 with 10468 MB memory) -> physical GPU (device: 6, name: GeForce RTX 2080 Ti, pci bus id: 0000:00:0e.0, compute capability: 7.5)
2020-10-29 09:46:58.700247: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-10-29 09:46:58.701263: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1247] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:7 with 10468 MB memory) -> physical GPU (device: 7, name: GeForce RTX 2080 Ti, pci bus id: 0000:00:0f.0, compute capability: 7.5)
INFO:tensorflow:Running local_init_op.
I1029 09:47:14.201900 140162062960448 session_manager.py:505] Running local_init_op.
INFO:tensorflow:Done running local_init_op.
I1029 09:47:15.560581 140162062960448 session_manager.py:508] Done running local_init_op.
INFO:tensorflow:Calling checkpoint listeners before saving checkpoint 0...
I1029 09:47:54.193806 140162062960448 basic_session_run_hooks.py:614] Calling checkpoint listeners before saving checkpoint 0...
INFO:tensorflow:Saving checkpoints for 0 into /data/likede/workspace/t2t/translate_source_target/model8_big_docker/model.ckpt.
I1029 09:47:54.273618 140162062960448 basic_session_run_hooks.py:618] Saving checkpoints for 0 into /data/likede/workspace/t2t/translate_source_target/model8_big_docker/model.ckpt.
INFO:tensorflow:Calling checkpoint listeners after saving checkpoint 0...
I1029 09:48:10.543252 140162062960448 basic_session_run_hooks.py:626] Calling checkpoint listeners after saving checkpoint 0...
2020-10-29 09:49:14.093171: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-10-29 09:49:28.190152: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:184] Filling up shuffle buffer (this may take a while): 360 of 512
2020-10-29 09:49:31.743436: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:233] Shuffle buffer filled.
2020-10-29 09:49:41.935162: W tensorflow/core/common_runtime/bfc_allocator.cc:434] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.81MiB (rounded to 2949120)
Current allocation summary follows.
2020-10-29 09:49:41.935272: I tensorflow/core/common_runtime/bfc_allocator.cc:934] BFCAllocator dump for GPU_0_bfc
2020-10-29 09:49:41.935288: I tensorflow/core/common_runtime/bfc_allocator.cc:941] Bin (256): 	Total Chunks: 618, Chunks in use: 617. 154.5KiB allocated for chunks. 154.2KiB in use in bin. 2.7KiB client-requested in use in bin.
2020-10-29 09:49:41.935306: I tensorflow/core/common_runtime/bfc_allocator.cc:941] Bin (512): 	Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2020-10-29 09:49:41.935313: I tensorflow/core/common_runtime/bfc_allocator.cc:941] Bin (1024): 	Total Chunks: 1, Chunks in use: 1. 1.2KiB allocated for chunks. 1.2KiB in use in bin. 1.0KiB client-requested in use in bin.
2020-10-29 09:49:41.935321: I tensorflow/core/common_runtime/bfc_allocator.cc:941] Bin (2048): 	Total Chunks: 183, Chunks in use: 183. 545.5KiB allocated for chunks. 545.5KiB in use in bin. 509.6KiB client-requested in use in bin.
2020-10-29 09:49:41.935328: I tensorflow/core/common_runtime/bfc_allocator.cc:941] Bin (4096): 	Total Chunks: 266, Chunks in use: 266. 1.05MiB allocated for chunks. 1.05MiB in use in bin. 1.03MiB client-requested in use in bin.
2020-10-29 09:49:41.935334: I tensorflow/core/common_runtime/bfc_allocator.cc:941] Bin (8192): 	Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2020-10-29 09:49:41.935342: I tensorflow/core/common_runtime/bfc_allocator.cc:941] Bin (16384): 	Total Chunks: 1, Chunks in use: 1. 16.0KiB allocated for chunks. 16.0KiB in use in bin. 16.0KiB client-requested in use in bin.
2020-10-29 09:49:41.935348: I tensorflow/core/common_runtime/bfc_allocator.cc:941] Bin (32768): 	Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2020-10-29 09:49:41.935353: I tensorflow/core/common_runtime/bfc_allocator.cc:941] Bin (65536): 	Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2020-10-29 09:49:41.935360: I tensorflow/core/common_runtime/bfc_allocator.cc:941] Bin (131072): 	Total Chunks: 2, Chunks in use: 2. 256.0KiB allocated for chunks. 256.0KiB in use in bin. 256.0KiB client-requested in use in bin.
2020-10-29 09:49:41.935367: I tensorflow/core/common_runtime/bfc_allocator.cc:941] Bin (262144): 	Total Chunks: 104, Chunks in use: 104. 36.61MiB allocated for chunks. 36.61MiB in use in bin. 36.56MiB client-requested in use in bin.
2020-10-29 09:49:41.935374: I tensorflow/core/common_runtime/bfc_allocator.cc:941] Bin (524288): 	Total Chunks: 4, Chunks in use: 4. 2.55MiB allocated for chunks. 2.55MiB in use in bin. 1.41MiB client-requested in use in bin.
2020-10-29 09:49:41.935380: I tensorflow/core/common_runtime/bfc_allocator.cc:941] Bin (1048576): 	Total Chunks: 1, Chunks in use: 0. 1.50MiB allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2020-10-29 09:49:41.935386: I tensorflow/core/common_runtime/bfc_allocator.cc:941] Bin (2097152): 	Total Chunks: 500, Chunks in use: 500. 1.37GiB allocated for chunks. 1.37GiB in use in bin. 1.37GiB client-requested in use in bin.
2020-10-29 09:49:41.935392: I tensorflow/core/common_runtime/bfc_allocator.cc:941] Bin (4194304): 	Total Chunks: 649, Chunks in use: 649. 2.54GiB allocated for chunks. 2.54GiB in use in bin. 2.53GiB client-requested in use in bin.
2020-10-29 09:49:41.935399: I tensorflow/core/common_runtime/bfc_allocator.cc:941] Bin (8388608): 	Total Chunks: 162, Chunks in use: 162. 1.65GiB allocated for chunks. 1.65GiB in use in bin. 1.65GiB client-requested in use in bin.
2020-10-29 09:49:41.935406: I tensorflow/core/common_runtime/bfc_allocator.cc:941] Bin (16777216): 	Total Chunks: 217, Chunks in use: 217. 3.40GiB allocated for chunks. 3.40GiB in use in bin. 3.38GiB client-requested in use in bin.
2020-10-29 09:49:41.935412: I tensorflow/core/common_runtime/bfc_allocator.cc:941] Bin (33554432): 	Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2020-10-29 09:49:41.935417: I tensorflow/core/common_runtime/bfc_allocator.cc:941] Bin (67108864): 	Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2020-10-29 09:49:41.935427: I tensorflow/core/common_runtime/bfc_allocator.cc:941] Bin (134217728): 	Total Chunks: 8, Chunks in use: 8. 1.22GiB allocated for chunks. 1.22GiB in use in bin. 1.22GiB client-requested in use in bin.
2020-10-29 09:49:41.935433: I tensorflow/core/common_runtime/bfc_allocator.cc:941] Bin (268435456): 	Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2020-10-29 09:49:41.935439: I tensorflow/core/common_runtime/bfc_allocator.cc:957] Bin for 2.81MiB was 2.00MiB, Chunk State: 
2020-10-29 09:49:41.935444: I tensorflow/core/common_runtime/bfc_allocator.cc:970] Next region of size 2385998592
2020-10-29 09:49:41.935453: I tensorflow/core/common_runtime/bfc_allocator.cc:990] InUse at 7f61ca000000 of size 163852288 next 1710
2020-10-29 09:49:41.935458: I tensorflow/core/common_runtime/bfc_allocator.cc:990] InUse at 7f61d3c43000 of size 163852288 next 1711
2020-10-29 09:49:41.935463: I tensorflow/core/common_runtime/bfc_allocator.cc:990] InUse at 7f61dd886000 of size 163852288 next 1712
2020-10-29 09:49:41.935468: I tensorflow/core/common_runtime/bfc_allocator.cc:990] InUse at 7f61e74c9000 of size 163852288 next 1713
2020-10-29 09:49:41.935475: I tensorflow/core/common_runtime/bfc_allocator.cc:990] InUse at 7f61f110c000 of size 3072 next 2036
2020-10-29 09:49:41.935480: I tensorflow/core/common_runtime/bfc_allocator.cc:990] InUse at 7f61f110cc00 of size 3072 next 2037
2020-10-29 09:49:41.935485: I tensorflow/core/common_runtime/bfc_allocator.cc:990] InUse at 7f61f110d800 of size 3072 next 2038
2020-10-29 09:49:41.935490: I tensorflow/core/common_runtime/bfc_allocator.cc:990] InUse at 7f61f110e400 of size 2949120 next 2039
2020-10-29 09:49:41.935495: I tensorflow/core/common_runtime/bfc_allocator.cc:990] InUse at 7f61f13de400 of size 2949120 next 2040
2020-10-29 09:49:41.935500: I tensorflow/core/common_runtime/bfc_allocator.cc:990] InUse at 7f61f16ae400 of size 2949120 next 2041
2020-10-29 09:49:41.935505: I tensorflow/core/common_runtime/bfc_allocator.cc:990] InUse at 7f61f197e400 of size 2949120 next 2042
2020-10-29 09:49:41.935509: I tensorflow/core/common_runtime/bfc_allocator.cc:990] InUse at 7f61f1c4e400 of size 2949120 next 2043
2020-10-29 09:49:41.935514: I tensorflow/core/common_runtime/bfc_allocator.cc:990] InUse at 7f61f1f1e400 of size 2949120 next 2044
2020-10-29 09:49:41.935519: I tensorflow/core/common_runtime/bfc_allocator.cc:990] InUse at 7f61f21ee400 of size 3072 next 2045
2020-10-29 09:49:41.935524: I tensorflow/core/common_runtime/bfc_allocator.cc:990] InUse at 7f61f21ef000 of size 2949120 next 2046
2020-10-29 09:49:41.935529: I tensorflow/core/common_runtime/bfc_allocator.cc:990] InUse at 7f61f24bf000 of size 10698752 next 2047
...
2020-10-29 09:49:42.050315: I tensorflow/core/common_runtime/bfc_allocator.cc:998] 9 Chunks of size 10452992 totalling 89.72MiB
2020-10-29 09:49:42.050321: I tensorflow/core/common_runtime/bfc_allocator.cc:998] 9 Chunks of size 10518528 totalling 90.28MiB
2020-10-29 09:49:42.050326: I tensorflow/core/common_runtime/bfc_allocator.cc:998] 9 Chunks of size 10633216 totalling 91.27MiB
2020-10-29 09:49:42.050332: I tensorflow/core/common_runtime/bfc_allocator.cc:998] 11 Chunks of size 10698752 totalling 112.23MiB
2020-10-29 09:49:42.050337: I tensorflow/core/common_runtime/bfc_allocator.cc:998] 9 Chunks of size 10731520 totalling 92.11MiB
2020-10-29 09:49:42.050344: I tensorflow/core/common_runtime/bfc_allocator.cc:998] 9 Chunks of size 10829824 totalling 92.95MiB
2020-10-29 09:49:42.050349: I tensorflow/core/common_runtime/bfc_allocator.cc:998] 9 Chunks of size 10895360 totalling 93.52MiB
2020-10-29 09:49:42.050355: I tensorflow/core/common_runtime/bfc_allocator.cc:998] 24 Chunks of size 12800000 totalling 292.97MiB
2020-10-29 09:49:42.050360: I tensorflow/core/common_runtime/bfc_allocator.cc:998] 6 Chunks of size 12804096 totalling 73.27MiB
2020-10-29 09:49:42.050366: I tensorflow/core/common_runtime/bfc_allocator.cc:998] 1 Chunks of size 12985856 totalling 12.38MiB
2020-10-29 09:49:42.050371: I tensorflow/core/common_runtime/bfc_allocator.cc:998] 1 Chunks of size 14266368 totalling 13.61MiB
2020-10-29 09:49:42.050377: I tensorflow/core/common_runtime/bfc_allocator.cc:998] 1 Chunks of size 15785984 totalling 15.05MiB
2020-10-29 09:49:42.050382: I tensorflow/core/common_runtime/bfc_allocator.cc:998] 215 Chunks of size 16777216 totalling 3.36GiB
2020-10-29 09:49:42.050388: I tensorflow/core/common_runtime/bfc_allocator.cc:998] 1 Chunks of size 19642880 totalling 18.73MiB
2020-10-29 09:49:42.050393: I tensorflow/core/common_runtime/bfc_allocator.cc:998] 1 Chunks of size 25227264 totalling 24.06MiB
2020-10-29 09:49:42.050399: I tensorflow/core/common_runtime/bfc_allocator.cc:998] 8 Chunks of size 163852288 totalling 1.22GiB
2020-10-29 09:49:42.050407: I tensorflow/core/common_runtime/bfc_allocator.cc:1002] Sum Total of in-use chunks: 10.22GiB
2020-10-29 09:49:42.050413: I tensorflow/core/common_runtime/bfc_allocator.cc:1004] total_region_allocated_bytes_: 10976981760 memory_limit_: 10976981811 available bytes: 51 curr_region_allocation_bytes_: 17179869184
2020-10-29 09:49:42.050420: I tensorflow/core/common_runtime/bfc_allocator.cc:1010] Stats: 
Limit:                 10976981811
InUse:                 10975412736
MaxInUse:              10976981760
NumAllocs:                    3011
MaxAllocSize:            204812288

2020-10-29 09:49:42.050529: W tensorflow/core/common_runtime/bfc_allocator.cc:439] ****************************************************************************************************
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1365, in _do_call
    return fn(*args)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1350, in _run_fn
    target_list, run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1443, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
  (0) Internal: Dst tensor is not initialized.
	 [[{{node transformer/parallel_3_5/transformer/transformer/body/encoder/layer_3/ffn/layer_prepostprocess/layer_norm/sub}}]]
	 [[add_1/_22391]]
  (1) Internal: Dst tensor is not initialized.
	 [[{{node transformer/parallel_3_5/transformer/transformer/body/encoder/layer_3/ffn/layer_prepostprocess/layer_norm/sub}}]]
0 successful operations.
7 derived errors ignored.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/bin/t2t-trainer", line 7, in <module>
    exec(compile(f.read(), __file__, 'exec'))
  File "/data/likede/workspace/tensor2tensor-master/tensor2tensor/bin/t2t-trainer", line 34, in <module>
    tf.app.run(main)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/platform/app.py", line 40, in run
    _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
  File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 300, in run
    _run_main(main, args)
  File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 251, in _run_main
    sys.exit(main(argv))
  File "/data/likede/workspace/tensor2tensor-master/tensor2tensor/bin/t2t-trainer", line 28, in main
    t2t_trainer.main(argv)
  File "/data/likede/workspace/tensor2tensor-master/tensor2tensor/bin/t2t_trainer.py", line 418, in main
    execute_schedule(exp)
  File "/data/likede/workspace/tensor2tensor-master/tensor2tensor/bin/t2t_trainer.py", line 371, in execute_schedule
    getattr(exp, FLAGS.schedule)()
  File "/data/likede/workspace/tensor2tensor-master/tensor2tensor/utils/trainer_lib.py", line 468, in continuous_train_and_eval
    self._eval_spec)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/training.py", line 472, in train_and_evaluate
    return executor.run()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/training.py", line 613, in run
    return self.run_local()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/training.py", line 714, in run_local
    saving_listeners=saving_listeners)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 349, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1182, in _train_model
    return self._train_model_default(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1215, in _train_model_default
    saving_listeners)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1518, in _train_with_estimator_spec
    _, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss])
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/monitored_session.py", line 778, in run
    run_metadata=run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/monitored_session.py", line 1283, in run
    run_metadata=run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/monitored_session.py", line 1384, in run
    raise six.reraise(*original_exc_info)
  File "/usr/local/lib/python3.6/dist-packages/six.py", line 703, in reraise
    raise value
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/monitored_session.py", line 1369, in run
    return self._sess.run(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/monitored_session.py", line 1442, in run
    run_metadata=run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/monitored_session.py", line 1200, in run
    return self._sess.run(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 958, in run
    run_metadata_ptr)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1181, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1359, in _do_run
    run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1384, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
  (0) Internal: Dst tensor is not initialized.
	 [[node transformer/parallel_3_5/transformer/transformer/body/encoder/layer_3/ffn/layer_prepostprocess/layer_norm/sub (defined at data/likede/workspace/tensor2tensor-master/tensor2tensor/layers/common_layers.py:707) ]]
	 [[add_1/_22391]]
  (1) Internal: Dst tensor is not initialized.
	 [[node transformer/parallel_3_5/transformer/transformer/body/encoder/layer_3/ffn/layer_prepostprocess/layer_norm/sub (defined at data/likede/workspace/tensor2tensor-master/tensor2tensor/layers/common_layers.py:707) ]]
0 successful operations.
7 derived errors ignored.

Errors may have originated from an input operation.
Input Source operations connected to node transformer/parallel_3_5/transformer/transformer/body/encoder/layer_3/ffn/layer_prepostprocess/layer_norm/sub:
 transformer/parallel_3_5/transformer/transformer/body/encoder/layer_3/ffn/layer_prepostprocess/layer_norm/Mean (defined at data/likede/workspace/tensor2tensor-master/tensor2tensor/layers/common_layers.py:704)	
 transformer/parallel_3_5/transformer/transformer/body/encoder/layer_3/self_attention/layer_postprocess/add (defined at data/likede/workspace/tensor2tensor-master/tensor2tensor/layers/common_layers.py:908)

Input Source operations connected to node transformer/parallel_3_5/transformer/transformer/body/encoder/layer_3/ffn/layer_prepostprocess/layer_norm/sub:
 transformer/parallel_3_5/transformer/transformer/body/encoder/layer_3/ffn/layer_prepostprocess/layer_norm/Mean (defined at data/likede/workspace/tensor2tensor-master/tensor2tensor/layers/common_layers.py:704)	
 transformer/parallel_3_5/transformer/transformer/body/encoder/layer_3/self_attention/layer_postprocess/add (defined at data/likede/workspace/tensor2tensor-master/tensor2tensor/layers/common_layers.py:908)

Original stack trace for 'transformer/parallel_3_5/transformer/transformer/body/encoder/layer_3/ffn/layer_prepostprocess/layer_norm/sub':
  File "usr/local/bin/t2t-trainer", line 7, in <module>
    exec(compile(f.read(), __file__, 'exec'))
  File "data/likede/workspace/tensor2tensor-master/tensor2tensor/bin/t2t-trainer", line 34, in <module>
    tf.app.run(main)
  File "usr/local/lib/python3.6/dist-packages/tensorflow/python/platform/app.py", line 40, in run
    _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
  File "usr/local/lib/python3.6/dist-packages/absl/app.py", line 300, in run
    _run_main(main, args)
  File "usr/local/lib/python3.6/dist-packages/absl/app.py", line 251, in _run_main
    sys.exit(main(argv))
  File "data/likede/workspace/tensor2tensor-master/tensor2tensor/bin/t2t-trainer", line 28, in main
    t2t_trainer.main(argv)
  File "data/likede/workspace/tensor2tensor-master/tensor2tensor/bin/t2t_trainer.py", line 418, in main
    execute_schedule(exp)
  File "data/likede/workspace/tensor2tensor-master/tensor2tensor/bin/t2t_trainer.py", line 371, in execute_schedule
    getattr(exp, FLAGS.schedule)()
  File "data/likede/workspace/tensor2tensor-master/tensor2tensor/utils/trainer_lib.py", line 468, in continuous_train_and_eval
    self._eval_spec)
  File "usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/training.py", line 472, in train_and_evaluate
    return executor.run()
  File "usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/training.py", line 613, in run
    return self.run_local()
  File "usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/training.py", line 714, in run_local
    saving_listeners=saving_listeners)
  File "usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 349, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1182, in _train_model
    return self._train_model_default(input_fn, hooks, saving_listeners)
  File "usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1211, in _train_model_default
    self.config)
  File "usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1170, in _call_model_fn
    model_fn_results = self._model_fn(features=features, **kwargs)
  File "data/likede/workspace/tensor2tensor-master/tensor2tensor/utils/t2t_model.py", line 1421, in wrapping_model_fn
    use_tpu=use_tpu)
  File "data/likede/workspace/tensor2tensor-master/tensor2tensor/utils/t2t_model.py", line 1486, in estimator_model_fn
    logits, losses_dict = model(features)  # pylint: disable=not-callable
  File "usr/local/lib/python3.6/dist-packages/tensorflow/python/layers/base.py", line 547, in __call__
    outputs = super(Layer, self).__call__(inputs, *args, **kwargs)
  File "usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/base_layer_v1.py", line 778, in __call__
    outputs = call_fn(cast_inputs, *args, **kwargs)
  File "data/likede/workspace/tensor2tensor-master/tensor2tensor/utils/t2t_model.py", line 325, in call
    sharded_logits, losses = self.model_fn_sharded(sharded_features)
  File "data/likede/workspace/tensor2tensor-master/tensor2tensor/utils/t2t_model.py", line 365, in model_fn_sharded
    if self.use_body_sharded():
  File "usr/local/lib/python3.6/dist-packages/tensorflow/python/autograph/operators/control_flow.py", line 926, in if_stmt
    return _py_if_stmt(cond, body, orelse)
  File "usr/local/lib/python3.6/dist-packages/tensorflow/python/autograph/operators/control_flow.py", line 1035, in _py_if_stmt
    return body() if cond else orelse()
  File "data/likede/workspace/tensor2tensor-master/tensor2tensor/utils/t2t_model.py", line 402, in model_fn_sharded
    sharded_logits, sharded_losses = dp(self.model_fn, datashard_to_features)
  File "data/likede/workspace/tensor2tensor-master/tensor2tensor/utils/expert_utils.py", line 171, in __call__
    for i in range(self.n):
  File "usr/local/lib/python3.6/dist-packages/tensorflow/python/autograph/operators/control_flow.py", line 372, in for_stmt
    _py_for_stmt(iter_, extra_test, body, None, None)
  File "usr/local/lib/python3.6/dist-packages/tensorflow/python/autograph/operators/control_flow.py", line 401, in _py_for_stmt
    body(target)
  File "usr/local/lib/python3.6/dist-packages/tensorflow/python/autograph/operators/control_flow.py", line 387, in protected_body
    original_body(protected_iter)
  File "data/likede/workspace/tensor2tensor-master/tensor2tensor/utils/expert_utils.py", line 229, in __call__
    if self._devices[i] != DEFAULT_DEV_STRING:
  File "usr/local/lib/python3.6/dist-packages/tensorflow/python/autograph/operators/control_flow.py", line 926, in if_stmt
    return _py_if_stmt(cond, body, orelse)
  File "usr/local/lib/python3.6/dist-packages/tensorflow/python/autograph/operators/control_flow.py", line 1035, in _py_if_stmt
    return body() if cond else orelse()
  File "data/likede/workspace/tensor2tensor-master/tensor2tensor/utils/expert_utils.py", line 231, in __call__
    outputs.append(fns[i](*my_args[i], **my_kwargs[i]))
  File "data/likede/workspace/tensor2tensor-master/tensor2tensor/utils/t2t_model.py", line 429, in model_fn
    body_out = self.body(transformed_features)
  File "data/likede/workspace/tensor2tensor-master/tensor2tensor/models/transformer.py", line 243, in body
    if self.has_input:
  File "usr/local/lib/python3.6/dist-packages/tensorflow/python/autograph/operators/control_flow.py", line 926, in if_stmt
    return _py_if_stmt(cond, body, orelse)
  File "usr/local/lib/python3.6/dist-packages/tensorflow/python/autograph/operators/control_flow.py", line 1035, in _py_if_stmt
    return body() if cond else orelse()
  File "data/likede/workspace/tensor2tensor-master/tensor2tensor/models/transformer.py", line 246, in body
    encoder_output, encoder_decoder_attention_bias = self.encode(
  File "data/likede/workspace/tensor2tensor-master/tensor2tensor/models/transformer.py", line 201, in encode
    self._encoder_function, inputs, target_space, hparams,
  File "data/likede/workspace/tensor2tensor-master/tensor2tensor/models/transformer.py", line 103, in transformer_encode
    encoder_output = encoder_function(
  File "data/likede/workspace/tensor2tensor-master/tensor2tensor/layers/transformer_layers.py", line 201, in transformer_encoder
    for layer in range(hparams.num_encoder_layers or hparams.num_hidden_layers):
  File "usr/local/lib/python3.6/dist-packages/tensorflow/python/autograph/operators/control_flow.py", line 372, in for_stmt
    _py_for_stmt(iter_, extra_test, body, None, None)
  File "usr/local/lib/python3.6/dist-packages/tensorflow/python/autograph/operators/control_flow.py", line 401, in _py_for_stmt
    body(target)
  File "usr/local/lib/python3.6/dist-packages/tensorflow/python/autograph/operators/control_flow.py", line 387, in protected_body
    original_body(protected_iter)
  File "data/likede/workspace/tensor2tensor-master/tensor2tensor/layers/transformer_layers.py", line 244, in transformer_encoder
    y = transformer_ffn_layer(
  File "data/likede/workspace/tensor2tensor-master/tensor2tensor/layers/common_layers.py", line 950, in layer_preprocess
    layer_input,
  File "data/likede/workspace/tensor2tensor-master/tensor2tensor/layers/common_layers.py", line 904, in layer_prepostprocess
    if sequence == "none":
  File "usr/local/lib/python3.6/dist-packages/tensorflow/python/autograph/operators/control_flow.py", line 926, in if_stmt
    return _py_if_stmt(cond, body, orelse)
  File "usr/local/lib/python3.6/dist-packages/tensorflow/python/autograph/operators/control_flow.py", line 1035, in _py_if_stmt
    return body() if cond else orelse()
  File "data/likede/workspace/tensor2tensor-master/tensor2tensor/layers/common_layers.py", line 906, in layer_prepostprocess
    for c in sequence:
  File "usr/local/lib/python3.6/dist-packages/tensorflow/python/autograph/operators/control_flow.py", line 372, in for_stmt
    _py_for_stmt(iter_, extra_test, body, None, None)
  File "usr/local/lib/python3.6/dist-packages/tensorflow/python/autograph/operators/control_flow.py", line 401, in _py_for_stmt
    body(target)
  File "usr/local/lib/python3.6/dist-packages/tensorflow/python/autograph/operators/control_flow.py", line 387, in protected_body
    original_body(protected_iter)
  File "data/likede/workspace/tensor2tensor-master/tensor2tensor/layers/common_layers.py", line 907, in layer_prepostprocess
    if c == "a":
  File "usr/local/lib/python3.6/dist-packages/tensorflow/python/autograph/operators/control_flow.py", line 926, in if_stmt
    return _py_if_stmt(cond, body, orelse)
  File "usr/local/lib/python3.6/dist-packages/tensorflow/python/autograph/operators/control_flow.py", line 1035, in _py_if_stmt
    return body() if cond else orelse()
  File "data/likede/workspace/tensor2tensor-master/tensor2tensor/layers/common_layers.py", line 909, in layer_prepostprocess
    elif c == "z":
  File "usr/local/lib/python3.6/dist-packages/tensorflow/python/autograph/operators/control_flow.py", line 926, in if_stmt
    return _py_if_stmt(cond, body, orelse)
  File "usr/local/lib/python3.6/dist-packages/tensorflow/python/autograph/operators/control_flow.py", line 1035, in _py_if_stmt
    return body() if cond else orelse()
  File "data/likede/workspace/tensor2tensor-master/tensor2tensor/layers/common_layers.py", line 911, in layer_prepostprocess
    elif c == "n":
  File "usr/local/lib/python3.6/dist-packages/tensorflow/python/autograph/operators/control_flow.py", line 926, in if_stmt
    return _py_if_stmt(cond, body, orelse)
  File "usr/local/lib/python3.6/dist-packages/tensorflow/python/autograph/operators/control_flow.py", line 1035, in _py_if_stmt
    return body() if cond else orelse()
  File "data/likede/workspace/tensor2tensor-master/tensor2tensor/layers/common_layers.py", line 912, in layer_prepostprocess
    x = apply_norm(
  File "data/likede/workspace/tensor2tensor-master/tensor2tensor/layers/common_layers.py", line 822, in apply_norm
    if norm_type == "layer":
  File "usr/local/lib/python3.6/dist-packages/tensorflow/python/autograph/operators/control_flow.py", line 926, in if_stmt
    return _py_if_stmt(cond, body, orelse)
  File "usr/local/lib/python3.6/dist-packages/tensorflow/python/autograph/operators/control_flow.py", line 1035, in _py_if_stmt
    return body() if cond else orelse()
  File "data/likede/workspace/tensor2tensor-master/tensor2tensor/layers/common_layers.py", line 824, in apply_norm
    x, filters=depth, epsilon=epsilon, layer_collection=layer_collection)
  File "data/likede/workspace/tensor2tensor-master/tensor2tensor/layers/common_layers.py", line 727, in layer_norm
    return layer_norm_compute(x, epsilon, scale, bias,
  File "data/likede/workspace/tensor2tensor-master/tensor2tensor/layers/common_layers.py", line 707, in layer_norm_compute
    norm_x = (x - mean) * tf.rsqrt(variance + epsilon)
  File "usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/math_ops.py", line 984, in binary_op_wrapper
    return func(x, y, name=name)
  File "usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/gen_math_ops.py", line 10103, in sub
    "Sub", x=x, y=y, name=name)
  File "usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/op_def_library.py", line 744, in _apply_op_helper
    attrs=attr_protos, op_def=op_def)
  File "usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 3327, in _create_op_internal
    op_def=op_def)
  File "usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 1791, in __init__
    self._traceback = tf_stack.extract_stack()

Nov 02 '20 02:11 Likede15

Ps, Single GPU is ok for batch_size=1024 8 GPU training regardless of batch_size how much is set, oom will always be reported.

Nov 02 '20 03:11 Likede15

Hi @Likede15

I am experiencing the same problem. Have you solved it yet?

Feb 04 '21 17:02 dinosaxon

Hi @Likede15

I am experiencing the same problem. Have you solved it yet?

I changed tensorflow version 2.2 to 1.15.

Feb 05 '21 09:02 Likede15

Hi @Likede15

Thanks for your reply. I will give it a try and will let you know.

Feb 05 '21 13:02 dinosaxon

tensor2tensor tensor2tensor copied to clipboard

The transformer big of T2T 1.15.7 will report OOM error when it runs multi GPU training in Tensorflow 2.2.0

Description

Environment information

For bugs: reproduction and error logs

tensor2tensor
tensor2tensor copied to clipboard