R-DFPN_FPN_Tensorflow icon indicating copy to clipboard operation
R-DFPN_FPN_Tensorflow copied to clipboard

out of memory

Open DL-ljw opened this issue 6 years ago • 6 comments

Sorry for bothering you again. When I train it with one 1080 GPU with batchsize of 1. I got the following mistakes. How can I solve it?

2018-05-10 13:42:49: step247692 image_name:000624.jpg | rpn_loc_loss:0.189756244421 | rpn_cla_loss:0.214562356472 | rpn_total_loss:0.404318600893 | fast_rcnn_loc_loss:0.0 | fast_rcnn_cla_loss:0.00815858319402 | fast_rcnn_total_loss:0.00815858319402 | total_loss:1.17546725273 | per_cost_time:0.65540599823s out of memory invalid argument 2018-05-10 13:42:53.349625: E tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:639] failed to record completion event; therefore, failed to create inter-stream dependency 2018-05-10 13:42:53.349637: I tensorflow/stream_executor/stream.cc:4138] stream 0x55cd063dc880 did not memcpy host-to-device; source: 0x7fa30b0da010 2018-05-10 13:42:53.349641: E tensorflow/stream_executor/stream.cc:289] Error recording event in stream: error recording CUDA event on stream 0x55cd063dc950: CUDA_ERROR_ILLEGAL_ADDRESS; not marking stream as bad, as the Event object may be at fault. Monitor for further errors. 2018-05-10 13:42:53.349647: E tensorflow/stream_executor/cuda/cuda_event.cc:49] Error polling for event status: failed to query event: CUDA_ERROR_ILLEGAL_ADDRESS 2018-05-10 13:42:53.349650: F tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:203] Unexpected Event status: 1 an illegal memory access was encountered an illegal memory access was encountered

DL-ljw avatar May 12 '18 02:05 DL-ljw

Same problem. After 5000 steps, the problem occurs.

powermano avatar May 20 '18 08:05 powermano

What is your cudnn version?

powermano avatar May 20 '18 08:05 powermano

cuda8.0 cudnn5.0

DL-ljw avatar Jul 16 '18 08:07 DL-ljw

I have met the same question, do you have solved it. and how to sovle this. Thanks

liqi-lizezhong avatar Jan 14 '19 08:01 liqi-lizezhong

I found that by reducing the anchors can somehow alleviate this problem, you can reduce some angels or ratios in R-DFPN_FPN_Tensorflow/libs/configs/cfgs.py

ANCHOR_ANGLES = [-90, -75, -60, -45, -30, -15] ANCHOR_RATIOS = [1/5., 5., 1/7., 7., 1/9, 9]

I encountered the problem of CUDA_ERROR_ILLEGAL_ADDRESS error during training when the objects are densely located, so control the objects in your own dataset( reduce some really exsiting objects) can also alleviate this problem. It works but not all the time.

------------------ 原始邮件 ------------------ 发件人: "李泽中"[email protected]; 发送时间: 2019年1月14日(星期一) 下午4:19 收件人: "yangxue0827/R-DFPN_FPN_Tensorflow"[email protected]; 抄送: "victor"[email protected]; "Comment"[email protected]; 主题: Re: [yangxue0827/R-DFPN_FPN_Tensorflow] out of memory (#6)

I have met the same question, do you have solved it. and how to sovle this. Thanks

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

powermano avatar Jan 14 '19 08:01 powermano

I found that by reducing the anchors can somehow alleviate this problem, you can reduce some angels or ratios in R-DFPN_FPN_Tensorflow/libs/configs/cfgs.py ANCHOR_ANGLES = [-90, -75, -60, -45, -30, -15] ANCHOR_RATIOS = [1/5., 5., 1/7., 7., 1/9, 9] I encountered the problem of CUDA_ERROR_ILLEGAL_ADDRESS error during training when the objects are densely located, so control the objects in your own dataset( reduce some really exsiting objects) can also alleviate this problem. It works but not all the time. ------------------ 原始邮件 ------------------ 发件人: "李泽中"[email protected]; 发送时间: 2019年1月14日(星期一) 下午4:19 收件人: "yangxue0827/R-DFPN_FPN_Tensorflow"[email protected]; 抄送: "victor"[email protected]; "Comment"[email protected]; 主题: Re: [yangxue0827/R-DFPN_FPN_Tensorflow] out of memory (#6) I have met the same question, do you have solved it. and how to sovle this. Thanks — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

Thanks a lot! it works

clw5180 avatar Aug 05 '19 16:08 clw5180