SIN icon indicating copy to clipboard operation
SIN copied to clipboard

"OOM when allocating tensor" During Test

Open hydk0420 opened this issue 6 years ago • 8 comments

there is error when i run 'train.sh' or 'test.sh', as bellow: ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[65536,8192]

During training, i solve this problem by change 'config.gpu_options.per_process_gpu_memory_fraction = 0.7' (0.7 to 0.9) However, During test, this error appear again, i don't know how to solve this. can you give some suggestion

the configuration of my computer: ubuntu16.04 tensorflow1.3.0 GPU:gtx 1080 ti (with memory of 11172 MB)

hydk0420 avatar Jul 09 '18 10:07 hydk0420

Maybe you need to change the 'config.gpu_options.per_process_gpu_memory_fraction = 0.45' in line 94 of tools/test_net.py

choasup avatar Jul 09 '18 13:07 choasup

@choasup
sorry to bother you, i have a problem during training i set config.gpu_options.per_process_gpu_memory_fraction = 0.95 and two gtx 1080 ( 8105MiB each ) are used while training( i checked through nvidia-smi ) but i still encounter the OOM problem can you give me some suggestions to solve this problem?

jsxzwyx avatar Jul 24 '18 12:07 jsxzwyx

I am sorry that the memory of 8105MB is very small...maybe you can set smaller batchsize of RPN

choasup avatar Jul 26 '18 02:07 choasup

when you test ,you need change config.gpu_options.per_process_gpu_memory_fraction = 0.45 to 0.7. I solve this problem by this means.

joyeuxni avatar Sep 27 '18 01:09 joyeuxni

@choasup sorry to bother you, i have a problem during training i set config.gpu_options.per_process_gpu_memory_fraction = 0.95 and two gtx 1080 ( 8105MiB each ) are used while training( i checked through nvidia-smi ) but i still encounter the OOM problem can you give me some suggestions to solve this problem?

You can run this SIN code on multi-GPU? Right? Could you tell me What changes needs to be made? I also want to run it on multi GPU~

MXLHELLO avatar Jan 04 '19 08:01 MXLHELLO

change "config.gpu_options.per_process_gpu_memory_fraction = 0.7" to config.gpu_options.allow_growth = True

hydk0420 avatar Jan 04 '19 09:01 hydk0420

i read the SIN paper and found that In the model in this paper, ellipses indicate that there are multiple GRU connections. But i read the code network.py and find n_steps = 2. i encounter the OOM problem when i change it to 3 or more, Is it means more than 2 GRU needs more memory? sincerely hope for your reply.

xuyoujian123 avatar May 28 '19 07:05 xuyoujian123

excuse me, I'm running this model recently. I encountered this error during the training process: ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[16384,8192] [[Node: concat_5 = ConcatV2[N=2, T=DT_FLOAT, Tidx=DT_INT32, _device="/job:localhost/replica:0/task:0/gpu:0"](Reshape_9, Reshape_10, concat_5/axis)]] [[Node: Mean_5/_215 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_8841_Mean_5", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]] How did you change it? I need your answer urgently, thank you very much @choasup @hydk0420

veracheug avatar Jun 20 '20 13:06 veracheug