OOM error
Dear all We want to train a UT model on our dataset. in the training, after some steps, the OOM error occurs. We continued training with decrease bach size, but the error was not resolved!!! even with batch size one!! We used a Geforce 1080 Ti with 11 gig memory.
For example, This error occurred in the batch size equal one:
ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[1,16,230102,230102] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[node universal_transformer/parallel_0_5/universal_transformer/universal_transformer/body/encoder/universal_transformer_basic/foldl/while/self_attention/multihead_attention/dot_product_attention/MatMul (defined at /home/user/anaconda3/envs/tf_gpu/lib/python3.6/site-packages/tensor2tensor/layers/common_attention.py:1464) = BatchMatMul[T=DT_FLOAT, adj_x=false, adj_y=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](universal_transformer/parallel_0_5/universal_transformer/universal_transformer/body/encoder/universal_transformer_basic/foldl/while/self_attention/multihead_attention/mul, universal_transformer/parallel_0_5/universal_transformer/universal_transformer/body/encoder/universal_transformer_basic/foldl/while/self_attention/multihead_attention/split_heads_1/transpose)]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
[{{node universal_transformer/parallel_0_5/universal_transformer/universal_transformer/body/decoder/universal_transformer_basic/foldl/while/encdec_attention/multihead_attention/k/Tensordot/Shape/_481}} = _Recv client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_769_u...rdot/Shape", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"]]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
try to clean your t2t_train directory, and restart.
Solved by reducing batch_size.
---below is old reply---
Same question.
Even I set batch_size=1, OOM still happens.
Also, I tried to clean my t2t_train directory, still not working...
Here is my command
t2t-trainer \
--data_dir=~/t2t/t2t_data \
--problem=translate_ende_wmt32k \
--model=transformer \
--hparams_set=transformer_base \
--hparams="batch_size=2048" \
--schedule=continuous_train_and_eval \
--output_dir=~/t2t/t2t_train/translate_ende_wmt32k \
--train_steps=300000 \
--worker_gpu=10 \
--eval_steps=100