im2latex
im2latex copied to clipboard
OOM during training
got OOM when doing training:
Environment: tf: 1.4 GPU: Titan X python 2.7 Ubuntu 16.04
Error:
2018-01-07 22:12:42.933166: W tensorflow/core/framework/op_kernel.cc:1192] Resource exhausted: OOM when allocating tensor with shape[34560,1]
Traceback (most recent call last):
File "train.py", line 61, in
Caused by op u'attn_cell/rnn/while/rnn/att_mechanism/MatMul', defined at:
File "train.py", line 61, in
ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[34560,1] [[Node: attn_cell/rnn/while/rnn/att_mechanism/MatMul = MatMul[T=DT_FLOAT, transpose_a=false, transpose_b=false, _device="/job:localhost/replica:0/task:0/device:GPU:0"](attn_cell/rnn/while/rnn/att_mechanism/Reshape, attn_cell/rnn/while/rnn/att_mechanism/MatMul/Enter)]] [[Node: Mean/_85 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_2674_Mean", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]]
To add more info:
the output of "nvidia-smi" at the time of OOM: total 48 G GPU Memory were all used. +-----------------------------------------------------------------------------+ | NVIDIA-SMI 384.90 Driver Version: 384.90 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 TITAN X (Pascal) Off | 00000000:05:00.0 On | N/A | | 45% 73C P2 66W / 250W | 11762MiB / 12188MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 1 TITAN X (Pascal) Off | 00000000:06:00.0 Off | N/A | | 23% 33C P8 16W / 250W | 11588MiB / 12189MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 2 TITAN X (Pascal) Off | 00000000:09:00.0 Off | N/A | | 23% 31C P8 16W / 250W | 11588MiB / 12189MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 3 TITAN X (Pascal) Off | 00000000:0A:00.0 Off | N/A | | 23% 20C P8 15W / 250W | 11588MiB / 12189MiB | 0% Default | +-------------------------------+----------------------+----------------------+
Hi @mingchen62, It seems like the input image was really huge (after convolutions and flattening, got shape 34560 which I assume is batched so you just need to divide by the batch size). Did you use the harvard dataset? There might be some problem here. (check the shapes?) Otherwise, I remember having some OOM issue with a broken install of tensorflow due to undesired ubuntu updates. Depending on the GPU used, could happen that you have to lower the batch size... but 48gbs seems more than enough. Keep me updated if you find the fix / origin of your problem! Cheers, Guillaume
hi Guillaume thanks for great blog and GitHub repo. I tried a smaller batch size (4) and be able to run training. ( I do agree 48 G memory shall be enough for a decent batch size of 20).
I use Harvard data set (make build).
Will dig more into this and report back if I find out anything.
Wonder if it is to do with TF version. @guillaumegenthial are you using a different tf version than 1.4.0-rc0?
Saw this warning: /usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gradients_impl.py:96: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory. "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
I'm using tensorflow==1.4.1. This warning is expected and shouldn't be a problem. Have you found anything weird about shapes?
I solve this problem by deleting all images whose size > 400*160(about 250 images) from training set.
Thanks @luo3300612 , I met the same OOM problem, which happened in the same node "...rnn/att_mechanism/MatMul/...", i solved it by your advice. modified method _procedd_instance in model/utils/data_generator.py: ..... img = imread(self._dir_images + "/" + img_path) # 使用scipy中imread的读取图片 img_shape = np.shape(img) area = img_shape[0] * img_shape[1] max_area = 400 * 160 img = self._img_prepro(img) .... if area > max_area: skip = True
return inst, skip