im2latex icon indicating copy to clipboard operation
im2latex copied to clipboard

OOM during training

Open mingchen62 opened this issue 6 years ago • 7 comments

got OOM when doing training:

Environment: tf: 1.4 GPU: Titan X python 2.7 Ubuntu 16.04

Error: 2018-01-07 22:12:42.933166: W tensorflow/core/framework/op_kernel.cc:1192] Resource exhausted: OOM when allocating tensor with shape[34560,1] Traceback (most recent call last): File "train.py", line 61, in main() File "/usr/local/lib/python2.7/dist-packages/click/core.py", line 722, in call return self.main(*args, **kwargs) File "/usr/local/lib/python2.7/dist-packages/click/core.py", line 697, in main rv = self.invoke(ctx) File "/usr/local/lib/python2.7/dist-packages/click/core.py", line 895, in invoke return ctx.invoke(self.callback, **ctx.params) File "/usr/local/lib/python2.7/dist-packages/click/core.py", line 535, in invoke return callback(*args, **kwargs) File "train.py", line 57, in main model.train(config, train_set, val_set, lr_schedule) File "/home/hope/im2latex-1/model/base.py", line 160, in train lr_schedule) File "/home/hope/im2latex-1/model/img2seq.py", line 173, in _run_epoch feed_dict=fd) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 889, in run run_metadata_ptr) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1120, in _run feed_dict_tensor, options, run_metadata) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1317, in _do_run options, run_metadata) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1336, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[34560,1] [[Node: attn_cell/rnn/while/rnn/att_mechanism/MatMul = MatMul[T=DT_FLOAT, transpose_a=false, transpose_b=false, _device="/job:localhost/replica:0/task:0/device:GPU:0"](attn_cell/rnn/while/rnn/att_mechanism/Reshape, attn_cell/rnn/while/rnn/att_mechanism/MatMul/Enter)]] [[Node: Mean/_85 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_2674_Mean", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]]

Caused by op u'attn_cell/rnn/while/rnn/att_mechanism/MatMul', defined at: File "train.py", line 61, in main() File "/usr/local/lib/python2.7/dist-packages/click/core.py", line 722, in call return self.main(*args, **kwargs) File "/usr/local/lib/python2.7/dist-packages/click/core.py", line 697, in main rv = self.invoke(ctx) File "/usr/local/lib/python2.7/dist-packages/click/core.py", line 895, in invoke return ctx.invoke(self.callback, **ctx.params) File "/usr/local/lib/python2.7/dist-packages/click/core.py", line 535, in invoke return callback(*args, **kwargs) File "train.py", line 56, in main model.build_train(config) File "/home/hope/im2latex-1/model/img2seq.py", line 41, in build_train self._add_pred_op() File "/home/hope/im2latex-1/model/img2seq.py", line 119, in _add_pred_op self.dropout) File "/home/hope/im2latex-1/model/decoder.py", line 60, in call initial_state=attn_cell.initial_state()) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/rnn.py", line 614, in dynamic_rnn dtype=dtype) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/rnn.py", line 777, in _dynamic_rnn_loop swap_memory=swap_memory) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/control_flow_ops.py", line 2816, in while_loop result = loop_context.BuildLoop(cond, body, loop_vars, shape_invariants) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/control_flow_ops.py", line 2640, in BuildLoop pred, body, original_loop_vars, loop_vars, shape_invariants) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/control_flow_ops.py", line 2590, in _BuildLoop body_result = body(*packed_vars_for_body) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/rnn.py", line 762, in _time_step (output, new_state) = call_cell() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/rnn.py", line 748, in call_cell = lambda: cell(input_t, state) File "/home/hope/im2latex-1/model/components/attention_cell.py", line 109, in call new_output, new_state = self.step(inputs, state) File "/home/hope/im2latex-1/model/components/attention_cell.py", line 79, in step c = self._attention_mechanism.context(new_h) File "/home/hope/im2latex-1/model/components/attention_mechanism.py", line 83, in context e = tf.matmul(att_flat, att_beta) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/math_ops.py", line 1898, in matmul a, b, transpose_a=transpose_a, transpose_b=transpose_b, name=name) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_math_ops.py", line 2437, in _mat_mul name=name) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper op_def=op_def) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2960, in create_op op_def=op_def) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1473, in init self._traceback = self._graph._extract_stack() # pylint: disable=protected-access

ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[34560,1] [[Node: attn_cell/rnn/while/rnn/att_mechanism/MatMul = MatMul[T=DT_FLOAT, transpose_a=false, transpose_b=false, _device="/job:localhost/replica:0/task:0/device:GPU:0"](attn_cell/rnn/while/rnn/att_mechanism/Reshape, attn_cell/rnn/while/rnn/att_mechanism/MatMul/Enter)]] [[Node: Mean/_85 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_2674_Mean", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]]

mingchen62 avatar Jan 08 '18 03:01 mingchen62

To add more info:

the output of "nvidia-smi" at the time of OOM: total 48 G GPU Memory were all used. +-----------------------------------------------------------------------------+ | NVIDIA-SMI 384.90 Driver Version: 384.90 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 TITAN X (Pascal) Off | 00000000:05:00.0 On | N/A | | 45% 73C P2 66W / 250W | 11762MiB / 12188MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 1 TITAN X (Pascal) Off | 00000000:06:00.0 Off | N/A | | 23% 33C P8 16W / 250W | 11588MiB / 12189MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 2 TITAN X (Pascal) Off | 00000000:09:00.0 Off | N/A | | 23% 31C P8 16W / 250W | 11588MiB / 12189MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 3 TITAN X (Pascal) Off | 00000000:0A:00.0 Off | N/A | | 23% 20C P8 15W / 250W | 11588MiB / 12189MiB | 0% Default | +-------------------------------+----------------------+----------------------+

mingchen62 avatar Jan 08 '18 16:01 mingchen62

Hi @mingchen62, It seems like the input image was really huge (after convolutions and flattening, got shape 34560 which I assume is batched so you just need to divide by the batch size). Did you use the harvard dataset? There might be some problem here. (check the shapes?) Otherwise, I remember having some OOM issue with a broken install of tensorflow due to undesired ubuntu updates. Depending on the GPU used, could happen that you have to lower the batch size... but 48gbs seems more than enough. Keep me updated if you find the fix / origin of your problem! Cheers, Guillaume

guillaumegenthial avatar Jan 10 '18 00:01 guillaumegenthial

hi Guillaume thanks for great blog and GitHub repo. I tried a smaller batch size (4) and be able to run training. ( I do agree 48 G memory shall be enough for a decent batch size of 20).

I use Harvard data set (make build).

Will dig more into this and report back if I find out anything.

mingchen62 avatar Jan 12 '18 02:01 mingchen62

Wonder if it is to do with TF version. @guillaumegenthial are you using a different tf version than 1.4.0-rc0?

Saw this warning: /usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gradients_impl.py:96: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory. "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "

mingchen62 avatar Jan 12 '18 16:01 mingchen62

I'm using tensorflow==1.4.1. This warning is expected and shouldn't be a problem. Have you found anything weird about shapes?

guillaumegenthial avatar Jan 16 '18 05:01 guillaumegenthial

I solve this problem by deleting all images whose size > 400*160(about 250 images) from training set.

luo3300612 avatar Jan 17 '19 14:01 luo3300612

Thanks @luo3300612 , I met the same OOM problem, which happened in the same node "...rnn/att_mechanism/MatMul/...", i solved it by your advice. modified method _procedd_instance in model/utils/data_generator.py: ..... img = imread(self._dir_images + "/" + img_path) # 使用scipy中imread的读取图片 img_shape = np.shape(img) area = img_shape[0] * img_shape[1] max_area = 400 * 160 img = self._img_prepro(img) .... if area > max_area: skip = True

    return inst, skip

tnkong avatar May 13 '19 07:05 tnkong