caffe-segnet icon indicating copy to clipboard operation
caffe-segnet copied to clipboard

Out of memory error !!

Open hoticevijay opened this issue 8 years ago • 22 comments

I am trying to train camseq databse using segnet. I was able to prepare the dataset. But when I run training, I am getting Out of memory error. I am using nvidia gtx960 4 GB. I have even reduced batch size to 1 in both train and test, but I'm still getting the error

check failed: error == cudaSuccess (2 vs. 0) out of memory.

hoticevijay avatar Feb 10 '16 06:02 hoticevijay

Have you followed the tutorial? I think you should be able to train with 4GB of memory for both segnet and segnet-basic: http://mi.eng.cam.ac.uk/projects/segnet/tutorial.html

alexgkendall avatar Feb 10 '16 09:02 alexgkendall

@hoticevijay Try to down-sample the input ...

MahmoudElkhateeb avatar Feb 12 '16 20:02 MahmoudElkhateeb

I'm trying to run ./caffe-segnet/build/tools/caffe train -gpu 0 -solver SegNet-Tutorial/Models/segnet_solver.prototxt on GeForce GTX 460.

I reduced batch_size to 1.

Seems it's the same error, is it because out of memory?

I0225 23:41:36.637146 11302 net.cpp:247] Network initialization done.
I0225 23:41:36.637159 11302 net.cpp:248] Memory required for data: 1083110452
I0225 23:41:36.637718 11302 solver.cpp:42] Solver scaffolding done.
I0225 23:41:36.637940 11302 solver.cpp:250] Solving VGG_ILSVRC_16_layer
I0225 23:41:36.637953 11302 solver.cpp:251] Learning Rate Policy: step
F0225 23:41:36.745960 11302 syncedmem.cpp:51] Check failed: error == cudaSuccess (2 vs. 0)  out of memory
*** Check failure stack trace: ***
    @     0x7f3281712daa  (unknown)
    @     0x7f3281712ce4  (unknown)
    @     0x7f32817126e6  (unknown)
    @     0x7f3281715687  (unknown)
    @     0x7f3281b827db  caffe::SyncedMemory::mutable_gpu_data()
    @     0x7f3281b6a542  caffe::Blob<>::mutable_gpu_data()
    @     0x7f3281b91364  caffe::ConvolutionLayer<>::Forward_gpu()
    @     0x7f3281a72279  caffe::Net<>::ForwardFromTo()
    @     0x7f3281a726a7  caffe::Net<>::ForwardPrefilled()
    @     0x7f3281b45a55  caffe::Solver<>::Step()
    @     0x7f3281b4638f  caffe::Solver<>::Solve()
    @           0x406676  train()
    @           0x404bb1  main
    @     0x7f3280c24ec5  (unknown)
    @           0x40515d  (unknown)
    @              (nil)  (unknown)
Aborted (core dumped)

mrgloom avatar Feb 25 '16 20:02 mrgloom

It's strange but even in CPU mode it fails:


I0225 23:58:24.054597 12022 net.cpp:247] Network initialization done.
I0225 23:58:24.054610 12022 net.cpp:248] Memory required for data: 1083110452
I0225 23:58:24.055112 12022 solver.cpp:42] Solver scaffolding done.
I0225 23:58:24.055332 12022 solver.cpp:250] Solving VGG_ILSVRC_16_layer
I0225 23:58:24.055346 12022 solver.cpp:251] Learning Rate Policy: step
F0225 23:59:00.052022 12022 upsample_layer.cpp:127] upsample top index 0 out of range - check scale settings match input pooling layer's downsample setup
*** Check failure stack trace: ***
    @     0x7f7a1156cdaa  (unknown)
    @     0x7f7a1156cce4  (unknown)
    @     0x7f7a1156c6e6  (unknown)
    @     0x7f7a1156f687  (unknown)
    @     0x7f7a11984169  caffe::UpsampleLayer<>::Backward_cpu()
    @     0x7f7a118cb67d  caffe::Net<>::BackwardFromTo()
    @     0x7f7a118cb821  caffe::Net<>::Backward()
    @     0x7f7a1199fa5d  caffe::Solver<>::Step()
    @     0x7f7a119a038f  caffe::Solver<>::Solve()
    @           0x406676  train()
    @           0x404bb1  main
    @     0x7f7a10a7eec5  (unknown)
    @           0x40515d  (unknown)
    @              (nil)  (unknown)
Aborted (core dumped)

mrgloom avatar Feb 25 '16 21:02 mrgloom

Have you changed anything from the tutorial? Also are you able to test SegNet?

alexgkendall avatar Feb 25 '16 21:02 alexgkendall

I change only batch size to reduce memory usage as tutorial suggest.

Where I can find already trained model to test SegNet?

mrgloom avatar Feb 27 '16 07:02 mrgloom

Also I tried to downsample images to 240x180, but it seems that it's not so easy.

I0227 11:32:34.148586 13517 net.cpp:90] Creating Layer upsample4
I0227 11:32:34.148597 13517 net.cpp:410] upsample4 <- conv5_1_D
I0227 11:32:34.148608 13517 net.cpp:410] upsample4 <- pool4_mask
I0227 11:32:34.148624 13517 net.cpp:368] upsample4 -> pool4_D
I0227 11:32:34.148638 13517 net.cpp:120] Setting up upsample4
F0227 11:32:34.148663 13517 upsample_layer.cpp:63] Check failed: bottom[0]->height() == bottom[1]->height() (23 vs. 12) 
*** Check failure stack trace: ***
    @     0x7f6774aeadaa  (unknown)
    @     0x7f6774aeace4  (unknown)
    @     0x7f6774aea6e6  (unknown)
    @     0x7f6774aed687  (unknown)
    @     0x7f6774f01a88  caffe::UpsampleLayer<>::Reshape()
    @     0x7f6774e54502  caffe::Net<>::Init()
    @     0x7f6774e56262  caffe::Net<>::Net()
    @     0x7f6774f19f00  caffe::Solver<>::InitTrainNet()
    @     0x7f6774f1aed3  caffe::Solver<>::Init()
    @     0x7f6774f1b0a6  caffe::Solver<>::Solver()
    @           0x40c5d0  caffe::GetSolver<>()
    @           0x406611  train()
    @           0x404bb1  main
    @     0x7f6773ffcec5  (unknown)
    @           0x40515d  (unknown)
    @              (nil)  (unknown)

Seems it's related to https://github.com/alexgkendall/caffe-segnet/issues/10

mrgloom avatar Feb 27 '16 08:02 mrgloom

@mrgloom

You will need to change upsample_h, in certain layers .. the error above say that upsample_h at upsample4 layer should be 23.

layer { name: "upsample4" type: "Upsample" bottom: "pool4" bottom: "pool4_mask" top: "upsample4" upsample_param { upsample_h: 23 }

MahmoudElkhateeb avatar Feb 28 '16 10:02 MahmoudElkhateeb

I tried to change

layer {
  name: "upsample4"
  type: "Upsample"
  bottom: "conv5_1_D"
  top: "pool4_D"
  bottom: "pool4_mask"
  upsample_param {
    scale: 2
    upsample_w: 30#60 #depends on image input size
    upsample_h: 23#45
  }
}

but still have the same error(always 23 vs. 12):

F0303 00:56:11.244488  3618 upsample_layer.cpp:63] Check failed: bottom[0]->height() == bottom[1]->height() (23 vs. 12) 

Also my question is why some upsample layers have upsample_w, upsample_h and some just have scale=2?

mrgloom avatar Mar 02 '16 21:03 mrgloom

Seems I have understand that I need modify all upsample layers and specify upsample_w, upsample_h directly.

here is my segnet_solver.prototxt https://gist.github.com/mrgloom/f0972272938adfc44163

./caffe-segnet/build/tools/caffe train -gpu 0 -solver SegNet-Tutorial/Models/segnet_solver.prototxt

but even with reduces image size it takes too much memory: http://pastebin.com/dkJTgXQu

also when I try to train my model on CPU ./caffe-segnet/build/tools/caffe train -solver SegNet-Tutorial/Models/segnet_solver.prototxt http://pastebin.com/ZAANMbfp I get the same error

I0303 01:19:28.473911  3856 net.cpp:247] Network initialization done.
I0303 01:19:28.473920  3856 net.cpp:248] Memory required for data: 271680820
I0303 01:19:28.474397  3856 solver.cpp:42] Solver scaffolding done.
I0303 01:19:28.474613  3856 solver.cpp:250] Solving VGG_ILSVRC_16_layer
I0303 01:19:28.474623  3856 solver.cpp:251] Learning Rate Policy: step
F0303 01:19:34.556548  3856 upsample_layer.cpp:127] upsample top index 0 out of range - check scale settings match input pooling layer's downsample setup

Seems as here : https://github.com/alexgkendall/caffe-segnet/issues/5

mrgloom avatar Mar 02 '16 22:03 mrgloom

At least I successfully run pretrained model segnet_basic_camvid.caffemodel from model zoo, but only in CPU mode.

What I don't understand why it consumes about ~5000Mb RAM, while in log caffe say that Memory required for data: 410930228 which as I understand about ~400Mb?

export PATH=$PATH:/home/myuser/Downloads/SegNet/caffe-segnet/build/tools
export PYTHONPATH=/home/myuser/Downloads/SegNet/caffe-segnet/python:$PYTHONPATH
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/lib64

python /home/myuser/Downloads/SegNet/SegNet-Tutorial/Scripts/compute_bn_statistics.py /home/myuser/Downloads/SegNet/SegNet-Tutorial/Models/segnet_basic_train.prototxt /home/myuser/Downloads/SegNet/SegNet-Tutorial/Models/Training/segnet_basic_camvid.caffemodel /home/myuser/Downloads/SegNet/Models/Inference/ 
batch_size: 1
Memory required for data: 410930228
htop RES column ~4230Mb

batch_size: 2
Memory required for data: 821856308
htop RES column ~4683Mb

batch_size: 4
Memory required for data: 1643708468
htop RES column ~5598Mb

batch_size: 8
Memory required for data: 3287412788
htop RES column ~6597Mb (after 1st iteration 7261Mb) 

mrgloom avatar Mar 13 '16 20:03 mrgloom

Hi,

It is not a secret that Caffe generally not efficient in memory usage. It is a pay for speed. cuDNN could slightly improve the situation.

Best

arassadin avatar Apr 28 '16 15:04 arassadin

Can you eleborate on "Caffe generally not efficient in memory usage" is it related to convolution implementation http://caffe.berkeleyvision.org/tutorial/convolution.html ?

mrgloom avatar Apr 29 '16 08:04 mrgloom

@mrgloom:

is it related to convolution implementation

Seems yes.

arassadin avatar Apr 30 '16 18:04 arassadin

Hi @mrgloom I also have the Check failed: bottom[0]->height() == bottom[1]->height() (23 vs. 16) error, and I used your suggested segnet_solver.prototxt https://gist.github.com/mrgloom/f0972272938adfc44163 But same error is occurring. Any help is appreciated. Thanks!

sepidehhosseinzadeh avatar Sep 10 '16 00:09 sepidehhosseinzadeh

I got the exact same error:

23:41:36.637146 11302 net.cpp:247] Network initialization done.
I0225 23:41:36.637159 11302 net.cpp:248] Memory required for data: 1083110452
I0225 23:41:36.637718 11302 solver.cpp:42] Solver scaffolding done.
I0225 23:41:36.637940 11302 solver.cpp:250] Solving VGG_ILSVRC_16_layer
I0225 23:41:36.637953 11302 solver.cpp:251] Learning Rate Policy: step
F0225 23:41:36.745960 11302 syncedmem.cpp:51] Check failed: error == cudaSuccess (2 vs. 0)  out of memory
*** Check failure stack trace: ***
    @     0x7f3281712daa  (unknown)
    @     0x7f3281712ce4  (unknown)
    @     0x7f32817126e6  (unknown)
    @     0x7f3281715687  (unknown)
    @     0x7f3281b827db  caffe::SyncedMemory::mutable_gpu_data()
    @     0x7f3281b6a542  caffe::Blob<>::mutable_gpu_data()
    @     0x7f3281b91364  caffe::ConvolutionLayer<>::Forward_gpu()
    @     0x7f3281a72279  caffe::Net<>::ForwardFromTo()
    @     0x7f3281a726a7  caffe::Net<>::ForwardPrefilled()
    @     0x7f3281b45a55  caffe::Solver<>::Step()
    @     0x7f3281b4638f  caffe::Solver<>::Solve()
    @           0x406676  train()
    @           0x404bb1  main
    @     0x7f3280c24ec5  (unknown)
    @           0x40515d  (unknown)
    @              (nil)  (unknown)
Aborted (core dumped)

The solution was to install cudnn v2. And enable it in the make.config file.

Running on a GTX 980 ti 6GB, ubuntu 14.04, and cuda version 7.5

Seanberite avatar Sep 29 '16 10:09 Seanberite

the solution of @Seanberite works. I was running CUDA 8 without cuDNN. This would give the out of memory issue. I downgraded to CUDA 7.5 and cuDNN v2. I can now train the CamVid demo without the memory error on a GTX 960 (4GB, batch size 1).

khufkens avatar Jan 29 '17 21:01 khufkens

@mrgloom How did you resolve "upsample top index 0 out of range - check scale settings match input pooling layer's downsample setup" ? Thanks

tranvanhoa533 avatar Mar 06 '17 03:03 tranvanhoa533

Hi, in the beginning i got the exact same error reported here for the lack of memory, so i followed the advices i found here and rebuilted caffe-segnet with cudnn = 1 but now i get this error :

Setting up conv1_1 F0314 06:59:55.986902 10381 cudnn_conv_layer.cpp:53] Check failed: status == CUDNN_STATUS_SUCCESS (4 vs. 0) CUDNN_STATUS_INTERNAL_ERROR

Any idea ?

P.S : I tried with cudnn v5 and cuda 8.0 (i needed to use elements related to cudnn from a newer caffe repositery) I downgraded cudnn to v2 and get this :

I0314 08:21:15.397732 19896 net.cpp:247] Network initialization done. I0314 08:21:15.397734 19896 net.cpp:248] Memory required for data: 555925776 I0314 08:21:15.397902 19896 solver.cpp:42] Solver scaffolding done. I0314 08:21:15.398056 19896 solver.cpp:250] Solving VGG_ILSVRC_16_layer I0314 08:21:15.398059 19896 solver.cpp:251] Learning Rate Policy: step F0314 08:21:16.202524 19896 math_functions.cu:123] Check failed: status == CUBLAS_STATUS_SUCCESS (11 vs. 0) CUBLAS_STATUS_MAPPING_ERROR

I'm still with CUDA 8.0 though

Jorisfournel avatar Mar 14 '17 06:03 Jorisfournel

Is there an actual fix for this issue ? I'm using Ubuntu 16.04 / nvidia gtx1050 / Cuda 8.0 / No cudnn

Whiax avatar Sep 28 '17 19:09 Whiax

hi all, need help !! i get the same error with nvidia gtx 960m 4Gb of memory. i use, cuda 8.0, the batch_size = 1. when i test the webcam_demo.py with gpu, i got this error

I1014 19:09:01.359390 5819 net.cpp:247] Network initialization done. I1014 19:09:01.359392 5819 net.cpp:248] Memory required for data: 1065139200 (' Grabbed camera frame in ', '549.619197845', 'ms') (' Resized image in ', '66.5330886841', 'ms') F1014 19:09:03.429714 5819 syncedmem.cpp:51] Check failed: error == cudaSuccess (2 vs. 0) out of memory *** Check failure stack trace: *** Abandon (core dumped)

could you help me please ?

abderrazakiazzi avatar Oct 15 '17 18:10 abderrazakiazzi

I would recommand using Ubuntu14 as a fix if anyone faces this issue (tell me if it worked better). Otherwise, you can use ENet segmentation.

Whiax avatar Dec 29 '17 10:12 Whiax