caffe-segnet icon indicating copy to clipboard operation
caffe-segnet copied to clipboard

SegNet training instability: various errors cause training to abort

Open beejisbrigit opened this issue 7 years ago • 1 comments

I am having problems with training instability where SegNet caffe with periodically fail with various errors. At other times, however, training proceeds normally.

The first error I get is related to this issue in the SegNet-Tutorial github repo. If I remove the accuracy layer at the end of the model file, the problem seems to go away (see backtrace for reference to Accuracy layer probs):

1026 13:19:37.835263 2388 solver.cpp:266] Learning Rate Policy: step *** Error in `./caffe-segnet-multi-gpu/build/tools/caffe': malloc(): memory corruption (fast): 0x0000000008213fa0 *** *** Aborted at 1477455578 (unix time) try "date -d @1477455" if you are using GNU date *** PC: @ 0x7fdc1b426c37 (unknown) *** SIGABRT (@0x3e800000954) received by PID 2388 (TID 0x7fdc1d584780) from PID 2388; stack trace: *** @ 0x7fdc1b426cb0 (unknown) @ 0x7fdc1b426c37 (unknown) @ 0x7fdc1b42a028 (unknown) @ 0x7fdc1b4632a4 (unknown) @ 0x7fdc1b46dff7 (unknown) @ 0x7fdc1b470cf4 (unknown) @ 0x7fdc1b4726c0 (unknown) @ 0x7fdc1c059dad (unknown) @ 0x7fdc1ce006fd std::vector<>::_M_insert_aux() @ 0x7fdc1ce028ac caffe::AccuracyLayer<>::Forward_cpu() @ 0x7fdc1cd46a51 caffe::Net<>::ForwardFromTo() @ 0x7fdc1cd46dc7 caffe::Net<>::ForwardPrefilled() @ 0x7fdc1cd6bf19 caffe::Solver<>::Step() @ 0x7fdc1cd6c743 caffe::Solver<>::Solve() @ 0x408ebb train() @ 0x4069b1 main @ 0x7fdc1b411f45 (unknown) @ 0x40710c (unknown) @ 0x0 (unknown)

The second error is related to BLAS; I am using BLAS Atlas, the default for building caffe. This problem persists w/ and w/o the accuracy layer removed:

F1104 15:41:07.086585 12190 math_functions.cu:123] Check failed: status == CUBLAS_STATUS_SUCCESS (11 vs. 0) CUBLAS_STATUS_MAPPING_ERROR *** Check failure stack trace: *** @ 0x7f7f6c930daa (unknown) @ 0x7f7f6c930ce4 (unknown) @ 0x7f7f6c9306e6 (unknown) @ 0x7f7f6c933687 (unknown) @ 0x7f7f6cd94e7b caffe::caffe_gpu_asum<>() @ 0x7f7f6cd91e5f caffe::SoftmaxWithLossLayer<>::Backward_gpu() @ 0x7f7f6cc4102c caffe::Net<>::BackwardFromTo() @ 0x7f7f6cc41271 caffe::Net<>::Backward() @ 0x7f7f6cd4ae5d caffe::Solver<>::Step() @ 0x7f7f6cd4b77f caffe::Solver<>::Solve() @ 0x4086c8 train() @ 0x406c61 main @ 0x7f7f6be42ec5 (unknown) @ 0x40720d (unknown) @ (nil) (unknown)

beejisbrigit avatar Nov 04 '16 22:11 beejisbrigit

Hi, I am not an expert, and am new to Caffe, and this type of programming in general. However, I have seen the CUBLAS STATUS MAPPING ERROR before in my own networks. It happened for me when the incorrect number of outputs was set in the Softmax layer.

Since you say sometimes your net trains without error, I guess your problem may be different. Are you using your own data, or one of the examples? Do you have other processes active on the GPU?

nathanin avatar Nov 05 '16 01:11 nathanin