caffe icon indicating copy to clipboard operation
caffe copied to clipboard

make runtest segfaulting

Open nitbix opened this issue 9 years ago • 16 comments

I'm not sure what might be causing this, but here's what I'm seeing when I run make runtest on a checkout of master. I'm running on Debian Jessie, with GCC 4.9 and CUDA 8 RC. The only interesting thing about this machine is that it has 4x GTX 1080.

[----------] 2 tests from HingeLossLayerTest/2, where TypeParam = caffe::GPUDevice<float>
[ RUN      ] HingeLossLayerTest/2.TestGradientL2
[       OK ] HingeLossLayerTest/2.TestGradientL2 (6 ms)
[ RUN      ] HingeLossLayerTest/2.TestGradientL1
[       OK ] HingeLossLayerTest/2.TestGradientL1 (6 ms)
[----------] 2 tests from HingeLossLayerTest/2 (12 ms total)

[----------] 9 tests from AdaGradSolverTest/2, where TypeParam = caffe::GPUDevice<float>
[ RUN      ] AdaGradSolverTest/2.TestLeastSquaresUpdateWithEverythingAccumShare
[       OK ] AdaGradSolverTest/2.TestLeastSquaresUpdateWithEverythingAccumShare (12 ms)
[ RUN      ] AdaGradSolverTest/2.TestAdaGradLeastSquaresUpdateWithEverythingShare
*** Aborted at 1474827886 (unix time) try "date -d @1474827886" if you are using GNU date ***
PC: @     0x7fbbf4951e2d (unknown)
*** SIGSEGV (@0x1451f000) received by PID 23925 (TID 0x7fbc03491a00) from PID 340914176; stack trace: ***
    @     0x7fbbf4bd38d0 (unknown)
    @     0x7fbbf4951e2d (unknown)
    @     0x7fbbf5496350 std::vector<>::_M_erase()
    @     0x7fbbf549427d caffe::DevicePair::compute()
    @     0x7fbbf5499123 caffe::P2PSync<>::Prepare()
    @     0x7fbbf54997a0 caffe::P2PSync<>::Run()
    @           0x6af00e caffe::GradientBasedSolverTest<>::RunLeastSquaresSolver()
    @           0x6c2d2f caffe::GradientBasedSolverTest<>::TestLeastSquaresUpdate()
    @           0x6c31b0 caffe::AdaGradSolverTest_TestAdaGradLeastSquaresUpdateWithEverythingShare_Test<>::TestBody()
    @           0x8ff553 testing::internal::HandleExceptionsInMethodIfSupported<>()
    @           0x8f7eca testing::Test::Run()
    @           0x8f8018 testing::TestInfo::Run()
    @           0x8f80f5 testing::TestCase::Run()
    @           0x8f8a28 testing::internal::UnitTestImpl::RunAllTests()
    @           0x8f8d03 testing::UnitTest::Run()
    @           0x46e9df main
    @     0x7fbbf483ab45 (unknown)
    @           0x4764e9 (unknown)
    @                0x0 (unknown)
Makefile:526: recipe for target 'runtest' failed
make: *** [runtest] Segmentation fault

EDIT: I'm compiling with CuDNN enabled, but turning it off doesn't seem to make a difference.

nitbix avatar Sep 25 '16 18:09 nitbix

ran into exactly the same problem today with Ubuntu 16.04, 4 X K80, CUDA 8 RC, and GCC-5.3. Advice highly appreciated!

ruonanl avatar Sep 27 '16 21:09 ruonanl

This may be unrelated, but as an extra datapoint, I also get a segfault if I import pycaffe and theano in the same file and the try to do anything with caffe. Let me know if I can provide any extra info!

On Tue, 27 Sep 2016, 22:47 ruonanl, [email protected] wrote:

ran into exactly the same problem today with Ubuntu 16.04, 4 X K80, CUDA 8 RC, and GCC-5.3. Advice highly appreciated!

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/BVLC/caffe/issues/4772#issuecomment-250008887, or mute the thread https://github.com/notifications/unsubscribe-auth/ACYqiCs5Z-Kv08RXj7FJyxBSQaG5R-Reks5quY8HgaJpZM4KF9-e .

nitbix avatar Sep 27 '16 23:09 nitbix

An extra bit of observation: segfaults appear in several "SolverTest", but all share the same stack trace: std::vector<>::_M_erase() caffe::DevicePair::compute() caffe::P2PSync<>::Prepare() caffe::P2PSync<>::Run() caffe::GradientBasedSolverTest<>::TestLeastSquaresUpdate()

ruonanl avatar Sep 28 '16 13:09 ruonanl

Same here. Titan X (Pascal)_6+K80_2+GTX1080*1 + Ubuntu 16.04 + cudnn v5.1 + cuda 8 + GCC-5.4.

[----------] 12 tests from SGDSolverTest/2, where TypeParam = caffe::GPUDevice [ RUN ] SGDSolverTest/2.TestLeastSquaresUpdateWithWeightDecay *** Aborted at 1475986823 (unix time) try "date -d @1475986823" if you are using GNU date *** PC: @ 0x7f13e92fd512 (unknown) *** SIGSEGV (@0x19ae2000) received by PID 14082 (TID 0x7f13f0ac7ac0) from PID 430841856; stack trace: *** @ 0x7f13e958a3d0 (unknown) @ 0x7f13e92fd512 (unknown) @ 0x7f13e9eae280 std::vector<>::_M_erase() @ 0x7f13e9eac494 caffe::DevicePair::compute() @ 0x7f13e9eb1d50 caffe::P2PSync<>::Prepare() @ 0x7f13e9eb285e caffe::P2PSync<>::Run() @ 0x5b409e caffe::GradientBasedSolverTest<>::TestLeastSquaresUpdate() @ 0x5b49ff caffe::SGDSolverTest_TestLeastSquaresUpdateWithWeightDecay_Test<>::TestBody() @ 0x91ad53 testing::internal::HandleExceptionsInMethodIfSupported<>() @ 0x91436a testing::Test::Run() @ 0x9144b8 testing::TestInfo::Run() @ 0x914595 testing::TestCase::Run() @ 0x91586f testing::internal::UnitTestImpl::RunAllTests() @ 0x915b93 testing::UnitTest::Run() @ 0x46d9ed main @ 0x7f13e91d0830 __libc_start_main @ 0x475459 _start @ 0x0 (unknown) Makefile:526: recipe for target 'runtest' failed make: *** [runtest] Segmentation fault (core dumped)

denru01 avatar Oct 09 '16 04:10 denru01

I suspect it is a bug of multi-GPU support. I tried to use "export CUDA_VISIBLE_DEVICES=0" to make only 1 GPU visible to Caffe, and then I can successfully pass all the tests.

[==========] 2081 tests from 277 test cases ran. (353009 ms total) [ PASSED ] 2081 tests.

denru01 avatar Oct 09 '16 05:10 denru01

@nitbix

This may be unrelated, but as an extra datapoint, I also get a segfault if I import pycaffe and theano in the same file and the try to do anything with caffe. Let me know if I can provide any extra info!

This was fixed in theano in commit bb170f4fb201109f88b95da282ed3a21b5021c13 (23 Sep 2016). It was calling cudaThreadExit on shutdown which then caused a segfault when Caffe subsequently called cublasDestroy on cleanup

kevcampb avatar Dec 16 '16 03:12 kevcampb

Dear All,

Please advice how you solve this issue as I have the same problem. Any answer is highly appreciated. problem 1

RuaYahya avatar Mar 08 '17 01:03 RuaYahya

Hi all I have the same problem in ubuntu 16.4 screenshot from 2017-03-23 14-06-46 Any answer is highly appreciated. Thank you

@RuaYahya Did you solve issue?

karimpazoki avatar Mar 23 '17 09:03 karimpazoki

Hi , The proplem in my case is that my labtop does not have a Nvidia card . Check whether your graphical processing unit is nvidia or not. It works fine when I try another laptop. Thanks

RuaYahya avatar Mar 23 '17 10:03 RuaYahya

@RuaYahya *** SIGABRT (@0x113c) received by PID 4412 (TID 0x7f64016a5b00) from PID 4412; stack trace: *** @ 0x7f63ffd094b0 (unknown) @ 0x7f63ffd09428 gsignal @ 0x7f63ffd0b02a abort @ 0x7f63ffd4b7ea (unknown) @ 0x7f63ffd53e0a (unknown) @ 0x7f63ffd5798c cfree @ 0x7f64008878af google::protobuf::internal::DestroyDefaultRepeatedFields() @ 0x7f6400886b3b google::protobuf::ShutdownProtobufLibrary() @ 0x7f63e98c6329 (unknown) @ 0x7f64015a2c17 (unknown) @ 0x7f63ffd0dff8 (unknown) @ 0x7f63ffd0e045 exit @ 0x7f63ffcf4837 __libc_start_main @ 0x4077c9 _start @ 0x0 (unknown) Makefile:532: recipe for target 'runtest' failed

I have the same problem in ubuntu 16.4.Did you solve issue?

FangbRen avatar Jun 13 '17 08:06 FangbRen

@Mehuli-Ruh11 I believe he would simply include it before the command, like this export CUDA_VISIBLE_DEVICES=0 make runtest. This fixed the error for me, it's related to this line in Makefile.config # The ID of the GPU that 'make runtest' will use to run unit tests. TEST_GPUID := 0

MarcoForte avatar Jul 20 '17 21:07 MarcoForte

@denru01 I had a similar problem with you. I had a boost python package installed through conda, it has a different version with the one in my system. If you are using Anaconda, just uninstall the boost python package(conda uninstall boost) That might fix the problem.

ChloeEHKim avatar Oct 23 '17 12:10 ChloeEHKim

did someone find a solution ? I have the same problem and I'm running on ubuntu16.04 with only one gpu (gtx1080) and cuda8.

KevinTchaka avatar Nov 13 '17 12:11 KevinTchaka

@FangbRen 您好,我在安装caffe时遇到了和您相同的问题,想向您请教一下如何解决,谢谢

YuGongCharley avatar Jan 29 '19 16:01 YuGongCharley

i solved this issue by the command : make runtest -j export CUDA_VISIBLE_DEVICES=0

lswgh avatar Apr 07 '21 02:04 lswgh

I have the same problem, but it only like this. I use Ubuntu 16.04, CUDA=9.0 and cudnn=7.0 with 2080Ti. Any answer is highly appreciated.

image

learninginvision avatar Jul 01 '21 03:07 learninginvision