caffe
caffe copied to clipboard
memory error with mkldnn concat layer
hi, Im using a really simple network, and this error happen:
i modify the question, so that a small net can reproduce the error stably.
==========================================================================
update: this small network cause crash
input: "data" input_dim: 1 input_dim: 3 input_dim: 32 input_dim: 32
layer { name: "conv1" type: "Convolution" bottom: "data" top: "conv1" convolution_param { num_output: 16 kernel_size: 5 stride: 1 weight_filler { type: "xavier" } } } layer { name: "fire2/expand1x1" type: "Convolution" bottom: "conv1" top: "fire2/expand1x1" convolution_param { num_output: 12 kernel_size: 1 weight_filler { type: "xavier" } } } layer { name: "fire2/expand3x3" type: "Convolution" bottom: "conv1" top: "fire2/expand3x3" convolution_param { num_output: 12 pad: 1 kernel_size: 3 weight_filler { type: "xavier" } } } layer { name: "fire2/concat" type: "Concat" bottom: "fire2/expand1x1" bottom: "fire2/expand3x3" top: "fire2/concat" }
and cause :
(gdb) bt
#0 0x00007fffeb1428a5 in raise () from /home/usertest/my/my_algo/bin/../lib_cpu/libc.so.6
#1 0x00007fffeb144085 in abort () from /home/usertest/my/my_algo/bin/../lib_cpu/libc.so.6
#2 0x00007fffeb17fa37 in __libc_message () from /home/usertest/my/my_algo/bin/../lib_cpu/libc.so.6
#3 0x00007fffeb185366 in malloc_printerr () from /home/usertest/my/my_algo/bin/../lib_cpu/libc.so.6
#4 0x00007fffeb187e93 in _int_free () from /home/usertest/my/my_algo/bin/../lib_cpu/libc.so.6
#5 0x00007fffee5d099d in mkldnn::concat::concat(mkldnn::concat::primitive_desc const&, std::vector<mkldnn::primitive::at, std::allocatormkldnn::primitive::at >&, mkldnn::memory const&) ()
from /home/usertest/Caffe_lib/intel_caffe/.build_release/examples/cpp_classification/../../lib/libcaffe.so.1.1.2
#6 0x00007fffee5d623d in caffe::MKLDNNConcatLayer
it's more like out-of-memory allocation issue. how many memory do you have on these two machines?
it's more like out-of-memory allocation issue. how many memory do you have on these two machines?
thanks for reply, I got enough memory at two machines. About 160G each. I dont think this small network can eat all the memory.
I reproduced your issue at my side. I am digging into it. it looks like a memory corruption issue caused by mkldnn. will get back to you if I have findings.
sorry for late response.
I ran your test code on SKX, the error I met is a little different with yours, which crash on posix_malloc() with unresonable size. After digging into it, this issue is because MKLDNN doesn't support two nChw16c non-16 diviable vectors concat operations. a workaround to solve your problem would be forcing the concat layer to use CAFFE engine.