pytorch_RFCN
pytorch_RFCN copied to clipboard
Memory-Usage of GPU increases when training.
When I trained the network, memory-usage increased linearly. Has anyone else encountered this problem? I use python 3.6 and pytorch 4.0 in the experiments.
Yeah, I found that the memory used was increased within the training procedure, I set the BATCH_SIZE in cfgs/res101.yml to 1, after 3 epochs memory used is over 22G. I use Tesla P40 which has 22912MB memory only.
I have same problem, do you have solutions? after 500-600 iterations , i can't continue my training because out of memory
I thought it may caused by the absence of crop roi pooling. But I didn't test it.
crop roi pooling?I don't know what's your mean?can you explain in detail? i use psroi in my net and i compare psroi code and caffe codehttps://github.com/daijifeng001/caffe-rfcn/blob/4bcfcd104bb0b9f0862e127c71bd845ddf036f14/src/caffe/layers/psroi_pooling_layer.cu ,but i don't find why
In the original faster rcnn, it has roi pooling on the predicted rois, if you chose the crop pooling mode, the rois would be resized. I think this resize operation may help cut down the memory usage. However, I don't verify this supposition, if you have time, you can try this resize operation in psroi pooling, and please keep me in the progress.
emmm..i know,but the size of rois is not same?such as 7*7? and even resize can help cut down memory usage ,but it can't explain why memory increases gradually in training ?
it is very complex.
i found i meet same problem when i use roi pooling, so i guess maybe problem of version?
The faster rcnn don't have this problem. So I think the bug lies in the differences between faster rcnn and RFCN, not the version.
@zorrocai you can run it rightly if you use pytorch 0.31 ,so i guess right0.0 but i don't know why .
@GYxiaOH Thanks for the notice.
@zorrocai @GYxiaOH @k123v Hi,Have you found the solution why ps-roi pooling increase the memory when using pytorch-4.0
@lxtGH Sorry, I was busy with other project those days. I don't locate the problem.
the issue lies in the variables saved for backward in the psroipooling layer. Between pytorch 0.3 and 0.4, the functionality was changed to require using the save_for_backward() function instead of saving directly to ctx so the variables can be properly cleaned up.