STEP icon indicating copy to clipboard operation
STEP copied to clipboard

Runtime Error: Segmentation fault

Open TriLoo opened this issue 4 years ago • 5 comments

I use my own image datas with demo.py and got this error, no any other infos display.

I have located the postion causing this error, its roi layer calling. However, I tested the ROIAlign_cuda.cu not using PyTorch Tensor as parameters but use float * instead and no errors raise.

my gcc version is 4.8.5, is the gcc version critical ? any advices? thanks

TriLoo avatar Apr 16 '20 14:04 TriLoo

Thanks for your interest in our work! How many GPUs are you using for your job? Have you tried using 1 GPU?

xyang35 avatar Apr 17 '20 03:04 xyang35

My server contains 8 P40 GPUs. I tried just using one GPU (cuda:0) and same error happened.

I used

os.environ['CUDA_VISIBLE_DEVICE']="0"

# OR

torch.cuda.set_device(0)

to use cuda:0 only.

Also, I manually set the gpu_count to 0.

The top several calling stacks stored in core file is shown as below:

#0  0x00007f6df6eb13ac in construct<_object*, _object*> (__p=0xb, this=0x7f6e4ab5c318) at /usr/include/c++/4.8.2/ext/new_allocator.h:120
#1  _S_construct<_object*, _object*> (__p=0xb, __a=...) at /usr/include/c++/4.8.2/bits/alloc_traits.h:254
#2  construct<_object*, _object*> (__p=0xb, __a=...) at /usr/include/c++/4.8.2/bits/alloc_traits.h:393
#3  emplace_back<_object*> (this=0x7f6e4ab5c318) at /usr/include/c++/4.8.2/bits/vector.tcc:96
#4  push_back (__x=<unknown type in /search/odin/songminghui/githubs/STEP/external/maskrcnn_benchmark/roi_layers/_C.cpython-36m-x86_64-linux-gnu.so, CU 0x0, DIE 0x12877a>, this=0x7f6e4ab5c318) at /usr/include/c++/4.8.2/bits/stl_vector.h:920
#5  loader_life_support (this=0x7ffd998f01f0) at /search/odin/songminghui/anaconda3/lib/python3.6/site-packages/torch/lib/include/pybind11/cast.h:44
#6  pybind11::cpp_function::dispatcher (self=<optimized out>, args_in=0x7f6deee3be28, kwargs_in=0x0) at /search/odin/songminghui/anaconda3/lib/python3.6/site-packages/torch/lib/include/pybind11/pybind11.h:618

TriLoo avatar Apr 17 '20 03:04 TriLoo

@TriLoo Hello, I met the same problem, I used one TeslaP100 GPU with 16G memory, did you solve the problem? Looking forward to your reply!

quanh1990 avatar Apr 27 '20 11:04 quanh1990

Sorry, not yet ... It may be caused by the pytorch version, gcc version, but I am not sure.

By the way, my pytorch version is 1.0.2

TriLoo avatar Apr 27 '20 11:04 TriLoo

Thanks for your interest in our work! How many GPUs are you using for your job? Have you tried using 1 GPU?

@xyang35 Could you please provide your version of gcc? Thanks a lot.

quanh1990 avatar Apr 27 '20 23:04 quanh1990