caffe icon indicating copy to clipboard operation
caffe copied to clipboard

memory error with mkldnn concat layer

Open coolbei opened this issue 6 years ago • 4 comments

hi, Im using a really simple network, and this error happen:

i modify the question, so that a small net can reproduce the error stably.

==========================================================================

update: this small network cause crash

input: "data" input_dim: 1 input_dim: 3 input_dim: 32 input_dim: 32

layer { name: "conv1" type: "Convolution" bottom: "data" top: "conv1" convolution_param { num_output: 16 kernel_size: 5 stride: 1 weight_filler { type: "xavier" } } } layer { name: "fire2/expand1x1" type: "Convolution" bottom: "conv1" top: "fire2/expand1x1" convolution_param { num_output: 12 kernel_size: 1 weight_filler { type: "xavier" } } } layer { name: "fire2/expand3x3" type: "Convolution" bottom: "conv1" top: "fire2/expand3x3" convolution_param { num_output: 12 pad: 1 kernel_size: 3 weight_filler { type: "xavier" } } } layer { name: "fire2/concat" type: "Concat" bottom: "fire2/expand1x1" bottom: "fire2/expand3x3" top: "fire2/concat" }

and cause : (gdb) bt #0 0x00007fffeb1428a5 in raise () from /home/usertest/my/my_algo/bin/../lib_cpu/libc.so.6 #1 0x00007fffeb144085 in abort () from /home/usertest/my/my_algo/bin/../lib_cpu/libc.so.6 #2 0x00007fffeb17fa37 in __libc_message () from /home/usertest/my/my_algo/bin/../lib_cpu/libc.so.6 #3 0x00007fffeb185366 in malloc_printerr () from /home/usertest/my/my_algo/bin/../lib_cpu/libc.so.6 #4 0x00007fffeb187e93 in _int_free () from /home/usertest/my/my_algo/bin/../lib_cpu/libc.so.6 #5 0x00007fffee5d099d in mkldnn::concat::concat(mkldnn::concat::primitive_desc const&, std::vector<mkldnn::primitive::at, std::allocatormkldnn::primitive::at >&, mkldnn::memory const&) () from /home/usertest/Caffe_lib/intel_caffe/.build_release/examples/cpp_classification/../../lib/libcaffe.so.1.1.2 #6 0x00007fffee5d623d in caffe::MKLDNNConcatLayer::InitConcatFwd(std::vector<caffe::Blob, std::allocator<caffe::Blob> > const&, std::vector<caffe::Blob, std::allocator<caffe::Blob> > const&) () from /home/usertest/Caffe_lib/intel_caffe/.build_release/examples/cpp_classification/../../lib/libcaffe.so.1.1.2 #7 0x00007fffee5d73b6 in caffe::MKLDNNConcatLayer::Forward_cpu(std::vector<caffe::Blob, std::allocator<caffe::Blob> > const&, std::vector<caffe::Blob, std::allocator<caffe::Blob> > const&) () from /home/usertest/Caffe_lib/intel_caffe/.build_release/examples/cpp_classification/../../lib/libcaffe.so.1.1.2 #8 0x00007fffee3d3842 in caffe::Net::ForwardFromTo(int, int) () from /home/usertest/Caffe_lib/intel_caffe/.build_release/examples/cpp_classification/../../lib/libcaffe.so.1.1.2 #9 0x00007fffee3d3a95 in caffe::Net::Forward(float*) () from /home/usertest/Caffe_lib/intel_caffe/.build_release/examples/cpp_classification/../../lib/libcaffe.so.1.1.2 #10 0x0000555555558058 in main ()

coolbei avatar Jan 23 '19 08:01 coolbei

it's more like out-of-memory allocation issue. how many memory do you have on these two machines?

ftian1 avatar Jan 23 '19 08:01 ftian1

it's more like out-of-memory allocation issue. how many memory do you have on these two machines?

thanks for reply, I got enough memory at two machines. About 160G each. I dont think this small network can eat all the memory.

coolbei avatar Jan 24 '19 08:01 coolbei

I reproduced your issue at my side. I am digging into it. it looks like a memory corruption issue caused by mkldnn. will get back to you if I have findings.

ftian1 avatar Jan 30 '19 07:01 ftian1

sorry for late response.

I ran your test code on SKX, the error I met is a little different with yours, which crash on posix_malloc() with unresonable size. After digging into it, this issue is because MKLDNN doesn't support two nChw16c non-16 diviable vectors concat operations. a workaround to solve your problem would be forcing the concat layer to use CAFFE engine.

ftian1 avatar Aug 22 '19 01:08 ftian1