tensor2tensor icon indicating copy to clipboard operation
tensor2tensor copied to clipboard

OOM error

Open h-karami opened this issue 6 years ago • 2 comments

Dear all We want to train a UT model on our dataset. in the training, after some steps, the OOM error occurs. We continued training with decrease bach size, but the error was not resolved!!! even with batch size one!! We used a Geforce 1080 Ti with 11 gig memory.

For example, This error occurred in the batch size equal one:

ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[1,16,230102,230102] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[node universal_transformer/parallel_0_5/universal_transformer/universal_transformer/body/encoder/universal_transformer_basic/foldl/while/self_attention/multihead_attention/dot_product_attention/MatMul (defined at /home/user/anaconda3/envs/tf_gpu/lib/python3.6/site-packages/tensor2tensor/layers/common_attention.py:1464) = BatchMatMul[T=DT_FLOAT, adj_x=false, adj_y=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](universal_transformer/parallel_0_5/universal_transformer/universal_transformer/body/encoder/universal_transformer_basic/foldl/while/self_attention/multihead_attention/mul, universal_transformer/parallel_0_5/universal_transformer/universal_transformer/body/encoder/universal_transformer_basic/foldl/while/self_attention/multihead_attention/split_heads_1/transpose)]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

[{{node universal_transformer/parallel_0_5/universal_transformer/universal_transformer/body/decoder/universal_transformer_basic/foldl/while/encdec_attention/multihead_attention/k/Tensordot/Shape/_481}} = _Recv client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_769_u...rdot/Shape", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"]]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

h-karami avatar Feb 23 '19 11:02 h-karami

try to clean your t2t_train directory, and restart.

leon1208 avatar Mar 02 '19 06:03 leon1208

Solved by reducing batch_size.

---below is old reply---

Same question. Even I set batch_size=1, OOM still happens. Also, I tried to clean my t2t_train directory, still not working... Here is my command

t2t-trainer \
  --data_dir=~/t2t/t2t_data \
  --problem=translate_ende_wmt32k \
  --model=transformer \
  --hparams_set=transformer_base \
  --hparams="batch_size=2048" \
  --schedule=continuous_train_and_eval \
  --output_dir=~/t2t/t2t_train/translate_ende_wmt32k \
  --train_steps=300000 \
  --worker_gpu=10 \
  --eval_steps=100

shizhediao avatar Oct 20 '22 14:10 shizhediao