caffe-segnet-cudnn5 icon indicating copy to clipboard operation
caffe-segnet-cudnn5 copied to clipboard

Segmentation fault

Open bosmart opened this issue 8 years ago • 9 comments

I'm getting the following segmentation fault when running "make runtest". It works fine in the case of the original caffe-segnet (with cuDNN 3.0.8).

[ RUN ] LayerFactoryTest/2.TestCreateLayer *** Aborted at 1483730734 (unix time) try "date -d @1483730734" if you are using GNU date *** PC: @ 0x7fe5c0d9cf25 caffe::BasePrefetchingDataLayer<>::~BasePrefetchingDataLayer() *** SIGSEGV (@0x208) received by PID 8650 (TID 0x7fe5c15d5ac0) from PID 520; stack trace: *** @ 0x7fe5c033a390 (unknown) @ 0x7fe5c0d9cf25 caffe::BasePrefetchingDataLayer<>::~BasePrefetchingDataLayer() @ 0x7fe5c0e55099 caffe::DataLayer<>::~DataLayer() @ 0xb49c08 caffe::LayerFactoryTest_TestCreateLayer_Test<>::TestBody() @ 0xde7453 testing::internal::HandleExceptionsInMethodIfSupported<>() @ 0xde038a testing::Test::Run() @ 0xde04d8 testing::TestInfo::Run() @ 0xde05e5 testing::TestCase::Run() @ 0xde217f testing::internal::UnitTestImpl::RunAllTests() @ 0xde24a3 testing::UnitTest::Run() @ 0x8905cd main @ 0x7fe5ba028830 __libc_start_main @ 0x8973a9 _start @ 0x0 (unknown) Segmentation fault (core dumped) src/caffe/test/CMakeFiles/runtest.dir/build.make:57: recipe for target 'src/caffe/test/CMakeFiles/runtest' failed make[3]: *** [src/caffe/test/CMakeFiles/runtest] Error 139 CMakeFiles/Makefile2:328: recipe for target 'src/caffe/test/CMakeFiles/runtest.dir/all' failed make[2]: *** [src/caffe/test/CMakeFiles/runtest.dir/all] Error 2 CMakeFiles/Makefile2:335: recipe for target 'src/caffe/test/CMakeFiles/runtest.dir/rule' failed make[1]: *** [src/caffe/test/CMakeFiles/runtest.dir/rule] Error 2 Makefile:240: recipe for target 'runtest' failed make: *** [runtest] Error 2

bosmart avatar Jan 06 '17 19:01 bosmart

I have just noticed this https://github.com/TimoSaemann/caffe-segnet-cudnn5/issues/2 Ubuntu 16.04.1 LTS, CUDA 8.0, GeForce 980Ti.

Interestingly enough, on my second machine with Ubuntu 16.04.1 LTS, CUDA 8.0, Tesla K40 - it works without any issues.

bosmart avatar Jan 06 '17 19:01 bosmart

I can not reproduce that error. I tried it on 3 different machines and no error occurred:

  1. Ubuntu 14.04, CUDA 8.0, Titan X (Pascal), cuDNN v.4 /v.5 /v.5.1, compiled with cmake and make
  2. Ubuntu 14.04, CUDA 7.5, Titan X (Maxwell), cuDNN v.4 /v.5 /v.5.1, compiled with cmake and make
  3. Ubuntu 16, CUDA 8.0, GTX 980, cuDNN v.5.1, compiled with cmake

Did you compiled it with cmake or make? Did you change in your makefile.config something else then uncomment the cuDNN flag? Can you test and train SegNet anyway or which errors do you encounter?

TimoSaemann avatar Jan 11 '17 19:01 TimoSaemann

I have compiled with cmake in both cases i.e.

  1. Ubuntu 16.04.1 - CUDA 8.0 - Tesla K40 (works fine)
  2. Ubuntu 16.04.1 - CUDA 8.0 - GeForce 980Ti or Titan X (produces the fault)

Interestingly enough the fault only happens when caffe process is terminating. So it is able to complete the given number of iterations, save the snapshot etc. and then throws the fault when exiting.

bosmart avatar Jan 11 '17 19:01 bosmart

I also get this segfault, with cudnn 5.05. As @bosmart mentioned the SegNet trains, saves the solver state, and then apparently caffe's BasePrefetchingDataLayer dies when destructing the model

I0213 09:10:14.745064 29461 solver.cpp:322] Optimization Done. I0213 09:10:14.745074 29461 caffe.cpp:254] Optimization Done. *** Aborted at 1487005814 (unix time) try "date -d @1487005814" if you are using GNU date *** PC: @ 0x7f6497727d1c (unknown) *** SIGSEGV (@0xfffffff7) received by PID 29461 (TID 0x7f6499c259c0) from PID 18446744073709551607; stack trace: *** @ 0x7f64976dbcb0 (unknown) @ 0x7f6497727d1c (unknown) @ 0x7f649951c68b caffe::BasePrefetchingDataLayer<>::~BasePrefetchingDataLayer() @ 0x7f64995eeb5b caffe::DenseImageDataLayer<>::~DenseImageDataLayer() @ 0x7f64995eedb2 boost::detail::sp_counted_impl_p<>::dispose() @ 0x40fcd1 caffe::Net<>::~Net() @ 0x7f64994459e2 boost::detail::sp_counted_impl_p<>::dispose() @ 0x7f64994ad4b1 caffe::SGDSolver<>::~SGDSolver() @ 0x40dd59 boost::detail::shared_count::~shared_count() @ 0x40b5d1 train() @ 0x408363 main @ 0x7f64976c6f45 (unknown) @ 0x408ce1 (unknown) @ 0x0 (unknown) Segmentation fault (core dumped)

jgorgenucsd avatar Feb 15 '17 18:02 jgorgenucsd

Very similar issue here. CUDNN 5.1, CUDA 8.0, GeForce GTX 860M, Ubuntu 16.04. Various failed tests on runtest with both cmake and make, but SegNet runs and trains fine. However, if I'm using an LMDB data layer, I get a segmentation fault at the end of all runs, after everything is calculated and saved. If I put the del net command in any python script after initializing net, I get a segmentation fault. DenseImageData works fine, however. @bosmart @jgorgenucsd are you using DenseImageData input or some other type of input layer?

ilia-nikiforov avatar Feb 21 '17 09:02 ilia-nikiforov

Hi, I have the exactly same error, how do you solve it? thanks

xiaozai avatar Sep 12 '17 11:09 xiaozai

Having this same error (trains, saves solver state, fails); we're you able to reproduce @TimoSaemann? I can send along my full workflow shortly if that helps

drewbo avatar Apr 27 '18 01:04 drewbo

I am having the same error when I use lmdb. Does anyone the reason for the segmentation fault?

vsuryamurthy avatar May 28 '18 15:05 vsuryamurthy

As with others here, my problem disappeared when I switched machines. My particular switch was from a laptop with a GTX860M to a desktop with a GTX1070.

ilia-nikiforov avatar May 30 '18 19:05 ilia-nikiforov