pytorch_RFCN icon indicating copy to clipboard operation
pytorch_RFCN copied to clipboard

Memory-Usage of GPU increases when training.

Open ghost opened this issue 6 years ago • 14 comments

When I trained the network, memory-usage increased linearly. Has anyone else encountered this problem? I use python 3.6 and pytorch 4.0 in the experiments.

ghost avatar Jul 09 '18 01:07 ghost

Yeah, I found that the memory used was increased within the training procedure, I set the BATCH_SIZE in cfgs/res101.yml to 1, after 3 epochs memory used is over 22G. I use Tesla P40 which has 22912MB memory only.

zorrocai avatar Jul 31 '18 14:07 zorrocai

I have same problem, do you have solutions? after 500-600 iterations , i can't continue my training because out of memory

GYxiaOH avatar Aug 20 '18 04:08 GYxiaOH

I thought it may caused by the absence of crop roi pooling. But I didn't test it.

zorrocai avatar Aug 20 '18 04:08 zorrocai

crop roi pooling?I don't know what's your mean?can you explain in detail? i use psroi in my net and i compare psroi code and caffe codehttps://github.com/daijifeng001/caffe-rfcn/blob/4bcfcd104bb0b9f0862e127c71bd845ddf036f14/src/caffe/layers/psroi_pooling_layer.cu ,but i don't find why

GYxiaOH avatar Aug 20 '18 06:08 GYxiaOH

In the original faster rcnn, it has roi pooling on the predicted rois, if you chose the crop pooling mode, the rois would be resized. I think this resize operation may help cut down the memory usage. However, I don't verify this supposition, if you have time, you can try this resize operation in psroi pooling, and please keep me in the progress.

zorrocai avatar Aug 20 '18 06:08 zorrocai

emmm..i know,but the size of rois is not same?such as 7*7? and even resize can help cut down memory usage ,but it can't explain why memory increases gradually in training ?

GYxiaOH avatar Aug 20 '18 07:08 GYxiaOH

it is very complex.

zorrocai avatar Aug 20 '18 07:08 zorrocai

i found i meet same problem when i use roi pooling, so i guess maybe problem of version?

GYxiaOH avatar Aug 20 '18 09:08 GYxiaOH

The faster rcnn don't have this problem. So I think the bug lies in the differences between faster rcnn and RFCN, not the version.

zorrocai avatar Aug 21 '18 01:08 zorrocai

@zorrocai you can run it rightly if you use pytorch 0.31 ,so i guess right0.0 but i don't know why .

GYxiaOH avatar Aug 23 '18 11:08 GYxiaOH

@GYxiaOH Thanks for the notice.

zorrocai avatar Aug 23 '18 11:08 zorrocai

@zorrocai @GYxiaOH @k123v Hi,Have you found the solution why ps-roi pooling increase the memory when using pytorch-4.0

lxtGH avatar Aug 30 '18 17:08 lxtGH

@lxtGH Sorry, I was busy with other project those days. I don't locate the problem.

zorrocai avatar Sep 04 '18 12:09 zorrocai

the issue lies in the variables saved for backward in the psroipooling layer. Between pytorch 0.3 and 0.4, the functionality was changed to require using the save_for_backward() function instead of saving directly to ctx so the variables can be properly cleaned up.

dzhang97 avatar Feb 15 '19 01:02 dzhang97