pointnet2
pointnet2 copied to clipboard
Segmentation Fault while training Semantic Scene Parsing (scannet)
I have Ubuntu 19.10, python 2.7.17, tensorflow 1.14.0, and cuda 10.2. When I run train.py, I get a segmentation fault message:
python train.py
pid: 5492
WARNING:tensorflow:From /home/mradwan/other-projects/pointnet2/scannet/pointnet2_sem_seg.py:12: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.
Tensor("Placeholder_3:0", shape=(), dtype=bool, device=/device:GPU:0)
WARNING:tensorflow:From train.py:91: The name tf.train.exponential_decay is deprecated. Please use tf.compat.v1.train.exponential_decay instead.
WARNING:tensorflow:From train.py:111: The name tf.summary.scalar is deprecated. Please use tf.compat.v1.summary.scalar instead.
--- Get model and loss
WARNING:tensorflow:From /home/mradwan/other-projects/pointnet2/utils/pointnet_util.py:107: The name tf.variable_scope is deprecated. Please use tf.compat.v1.variable_scope instead.
Segmentation fault (core dumped)
Extra details:
The contents of the sh files I used to compiled the TF operators are:
tf_interpolate_compile.sh:
g++ -std=c++11 tf_interpolate.cpp -o tf_interpolate_so.so -shared -fPIC -I /usr/local/lib/python2.7/dist-packages/tensorflow/include -I /usr/local/cuda-10.2/include -I /usr/local/lib/python2.7/dist-packages/tensorflow/include/external/nsync/public -lcudart -L /usr/local/cuda-10.2/lib64/ -L/usr/local/lib/python2.7/dist-packages/tensorflow -ltensorflow_framework -O2 -D_GLIBCXX_USE_CXX11_ABI=0
tf_sampling_compile.sh:
/usr/local/cuda-10.2/bin/nvcc tf_sampling_g.cu -o tf_sampling_g.cu.o -c -O2 -DGOOGLE_CUDA=1 -x cu -Xcompiler -fPIC
g++ -std=c++11 tf_sampling.cpp tf_sampling_g.cu.o -o tf_sampling_so.so -shared -fPIC -I /usr/local/lib/python2.7/dist-packages/tensorflow/include -I /usr/local/cuda-10.2/include -I /usr/local/lib/python2.7/dist-packages/tensorflow/include/external/nsync/public -lcudart -L /usr/local/cuda-10.2/lib64/ -L/usr/local/lib/python2.7/dist-packages/tensorflow -ltensorflow_framework -O2 -D_GLIBCXX_USE_CXX11_ABI=0
tf_grouping_compile.sh:
/usr/local/cuda-10.2/bin/nvcc tf_grouping_g.cu -o tf_grouping_g.cu.o -c -O2 -DGOOGLE_CUDA=1 -x cu -Xcompiler -fPIC
g++ -std=c++11 tf_grouping.cpp tf_grouping_g.cu.o -o tf_grouping_so.so -shared -fPIC -I /usr/local/lib/python2.7/dist-packages/tensorflow/include -I /usr/local/cuda-10.2/include -I /usr/local/lib/python2.7/dist-packages/tensorflow/include/external/nsync/public -lcudart -L /usr/local/cuda-10.2/lib64/ -L/usr/local/lib/python2.7/dist-packages/tensorflow -ltensorflow_framework -O2 -D_GLIBCXX_USE_CXX11_ABI=0
I also needed to do some changes to make the compilation work: 1-copied /usr/local/lib/python2.7/dist-packages/tensorflow/libtensorflow_framework.so.1 to /usr/local/lib/python2.7/dist-packages/tensorflow/libtensorflow_framework.so 2-ran
pip install -U scikit-learn scipy matplotlib
3- copied the files plyfile.py, plyfile.pyc, eulerangles.py, eulerangle.pyc from pointnet/utils
parser.add_argument('--model', default='model', help='Model name [default: model]')
5- added sys.path.append(os.path.join(ROOT_DIR, 'models') in train.py, after sys.path.append(os.path.join(ROOT_DIR, 'utils'))
These steps helped getting the TF operators and train.py compiled, but then I get the error. Anyone has an idea why this happens?
@charlesq34 @ericyi @suhaochina @rqi-nuro
have you found a solution I have the same issue
Mee too!
@charlesq34 @ericyi @suhaochina @rqi-nuro
It's possible that the memory needed for the PointNet++ architecture is larger than what you have. You can try to build the same thing on google collab and see if it works.