DA-RNN
DA-RNN copied to clipboard
Error on running test code
When I run test_net.py, I encounter CUDA memory related errors (e.g. segmentation fault, CUDA error: an illegal memory access was encountered, etc). Error messages change from time to time. Anyone with the similar problems?
What kind of GPU do you use? DA-RNN needs at least 6gb I think. However, it may be related to other issues which different third party libraries that need to be installed correctly, see #2 / #10 . Also what kind of CUDA, cuDNN, TensorFlow and Ubuntu are you using?
I ran the code for training with no problem, so there is probably no problem with dependencies. I have a TITAN X and a Geforce GTX gpu. CUDA version: 8.0.61 CuDNN: 5.1 Ubuntu: 16.04 Tensorflow version: 1.2.1
Do you give the device ID as an input parameter to your script?
Check with nvidia-smi the ID of your Titan GPU and parse it to the script. I do not know which kind of GeForce GTX GPU you have, but a TITAN should run just fine. (However, the test script did not work yet, cause of #9 )
Btw, the training scripts have not been an issue ever, while the test scripts seems to be the trouble maker.
Yes, the device ID is 0. This is the command I ran: ./experiments/scripts/rgbd_scene_multi_rgbd_test.sh
And here is the inside of rgbd_scene_multi_rgbd_test.sh:
#!/bin/bash
set -x set -e
export PYTHONUNBUFFERED="True" export CUDA_VISIBLE_DEVICES=$1 #export LD_PRELOAD=/usr/lib/libtcmalloc.so.4
LOG="experiments/logs/rgbd_scene_multi_rgbd.txt.date +'%Y-%m-%d_%H-%M-%S'
"
exec &> >(tee -a "$LOG")
echo Logging output to "$LOG"
train FCN for multiple frames
time ./tools/train_net.py --gpu 0
--network vgg16
--weights data/imagenet_models/vgg16_convs.npy
--imdb rgbd_scene_train
--cfg experiments/cfgs/rgbd_scene_multi_rgbd.yml
--iters 40000
if [ -f $PWD/output/rgbd_scene/rgbd_scene_val/vgg16_fcn_rgbd_multi_frame_rgbd_scene_iter_40000/segmentations.pkl ] then rm $PWD/output/rgbd_scene/rgbd_scene_val/vgg16_fcn_rgbd_multi_frame_rgbd_scene_iter_40000/segmentations.pkl fi
test FCN for multiple frames
time ./tools/test_net.py --gpu 0
--network vgg16
--model output/rgbd_scene/rgbd_scene_train/vgg16_fcn_rgbd_multi_frame_rgbd_scene_iter_40000.ckpt
--imdb rgbd_scene_val
--cfg experiments/cfgs/rgbd_scene_multi_rgbd.yml
--rig data/RGBDScene/camera.json
--kfusion 1
have you tried running: ./experiments/scripts/rgbd_scene_multi_rgbd_test.sh 0 instead?
maybe try running it with sudo
The testing code calls the c++ KinectFusion library in Python. This step is not stable. I also encountered crashes, due to some malloc issue inside python. You can debug by running "gdb --args python ./tools/test_net.py --gpu 0 --network vgg16 --model output/rgbd_scene/rgbd_scene_train/vgg16_fcn_rgbd_multi_frame_rgbd_scene_iter_40000.ckpt --imdb rgbd_scene_val --cfg experiments/cfgs/rgbd_scene_multi_rgbd.yml --rig data/RGBDScene/camera.json", and backtrace to see the problem.
I ran this and there was no problem, but when I added --kfusion 1 at the end, I encountered this error:
[New Thread 0x7ffe65ffb700 (LWP 8553)] [New Thread 0x7ffe667fc700 (LWP 8554)] [New Thread 0x7ffe67fff700 (LWP 8555)] [New Thread 0x7ffe677fe700 (LWP 8556)] [New Thread 0x7ffe66ffd700 (LWP 8557)] [New Thread 0x7ffe5e22a700 (LWP 8558)] [New Thread 0x7ffe5da29700 (LWP 8559)] [New Thread 0x7ffe5d228700 (LWP 8560)] [New Thread 0x7ffe5ca27700 (LWP 8561)] [New Thread 0x7ffe4ffff700 (LWP 8562)] [New Thread 0x7ffe4f7fe700 (LWP 8563)] [New Thread 0x7ffe4effd700 (LWP 8564)] [New Thread 0x7ffe4e7fc700 (LWP 8565)] [New Thread 0x7ffe4dffb700 (LWP 8566)] [New Thread 0x7ffe4d7fa700 (LWP 8567)]
Thread 1 "python" received signal SIGSEGV, Segmentation fault. __memmove_avx_unaligned () at ../sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S:245 245 ../sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S: No such file or directory. (gdb) quit A debugging session is active.
Inferior 1 [process 7937] will be killed.
@kevinkit the same happens when I add 0 at the end of the command.
When I ran it with sudo, this error happens:
- set -e
- export PYTHONUNBUFFERED=True
- PYTHONUNBUFFERED=True
- export CUDA_VISIBLE_DEVICES=0
- CUDA_VISIBLE_DEVICES=0 ++ date +%Y-%m-%d_%H-%M-%S
- LOG=experiments/logs/rgbd_scene_multi_rgbd_test.txt.2017-08-15_17-08-27
- exec ++ tee -a experiments/logs/rgbd_scene_multi_rgbd_test.txt.2017-08-15_17-08-27
- echo Logging output to experiments/logs/rgbd_scene_multi_rgbd_test.txt.2017-08-15_17-08-27 Logging output to experiments/logs/rgbd_scene_multi_rgbd_test.txt.2017-08-15_17-08-27
- '[' -f /home/aliman/DA-RNN-master/output/rgbd_scene/rgbd_scene_val/vgg16_fcn_rgbd_multi_frame_rgbd_scene_iter_40000/segmentations.pkl ']'
- ./tools/test_net.py --gpu 0 --network vgg16 --model data/fcn_models/rgbd_scene/vgg16_fcn_rgbd_multi_frame_rgbd_scene_iter_40000.ckpt --imdb rgbd_scene_val --cfg experiments/cfgs/rgbd_scene_multi_rgbd.yml --rig data/RGBDScene/camera.json --kfusion 1
Traceback (most recent call last):
File "./tools/test_net.py", line 13, in
from fcn.test import test_net File "/home/aliman/DA-RNN-master/tools/../lib/fcn/test.py", line 25, in from kinect_fusion import kfusion ImportError: libkfusion.so: cannot open shared object file: No such file or directory
(But I have libkfusion.so in DA-RNN/lib/kinect_fusion/build directory)
Have you ever solved the problem?I encounter the same situation and I don't know how to work it our
Like mentioned by @yuxng before, you can try to backtrace the problem with the gdb debugger, with the command like mentioned before:
"You can debug by running "gdb --args python ./tools/test_net.py --gpu 0 --network vgg16 --model output/rgbd_scene/rgbd_scene_train/vgg16_fcn_rgbd_multi_frame_rgbd_scene_iter_40000.ckpt --imdb rgbd_scene_val --cfg experiments/cfgs/rgbd_scene_multi_rgbd.yml --rig data/RGBDScene/camera.json", and backtrace to see the problem."
@AliManUtd1993 ,do you compile the DA-RNN succesful? I always encounter the error in Kinect_Fusion
I compiled all parts except kinect_fusion part.
@AliManUtd1993 , do you compile the DA-RNN succesfully now? When I test_kinect_fusion.sh , it always show
ImportError: libkfusion.so: cannot open shared object file: No such file or directory
But libkfusion.so is in lib/kinect_fusion/build. And others can run succesfully.
No, I did not try anymore.
Thank you for your quick reply.
@yuxng @kevinkit I meet same problem and I find the error happend at kinect_fusion.cpp => create_tensors() => initMarchingCubesTables();
And I run "You can debug by running "gdb --args python ./tools/test_net.py --gpu 0 --network vgg16 --model output/rgbd_scene/rgbd_scene_train/vgg16_fcn_rgbd_multi_frame_rgbd_scene_iter_40000.ckpt --imdb rgbd_scene_val --cfg experiments/cfgs/rgbd_scene_multi_rgbd.yml --rig data/RGBDScene/camera.json", and backtrace to see the problem."
it shows
#6 0x00007ffff7814f45 in __libc_start_main (main=0x466e50
Hi, @beginnerFighting
Have you solved the problem "ImportError: libkfusion.so: cannot open shared object file: No such file or directory" ?
Thanks for your reply!
Hi, @Ramay7 , I also got the same error as you got. I am wondering if you have solved the issue or any suggestions. Thanks for your help!
Hi, @Wei2624 . I have gave up on this project and didn't find any solution, sorry.... :(
Hi, @beginnerFighting
Have you solved the problem "ImportError: libkfusion.so: cannot open shared object file: No such file or directory" ?
Thanks for your reply!
I think you forget this step:
Add the KinectFusion libary path
Shell export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$ROOT/lib/kinect_fusion/build
Every time I start the computer this step must be excuted,otherwise you'll meet that Error.
The testing code calls the c++ KinectFusion library in Python. This step is not stable. I also encountered crashes, due to some malloc issue inside python. You can debug by running "gdb --args python ./tools/test_net.py --gpu 0 --network vgg16 --model output/rgbd_scene/rgbd_scene_train/vgg16_fcn_rgbd_multi_frame_rgbd_scene_iter_40000.ckpt --imdb rgbd_scene_val --cfg experiments/cfgs/rgbd_scene_multi_rgbd.yml --rig data/RGBDScene/camera.json", and backtrace to see the problem.
@yuxng I want to know how you address the malloc issue you mentioned...It seems that I meet the same Error as you... I test the trained model with the commands : sudo gdb --args python ./tools/test_net.py --gpu 0 --network vgg16 --model data/fcn_models/rgbd_scene/vgg16_fcn_rgbd_multi_frame_rgbd_scene_iter_40000.ckpt --imdb rgbd_scene_val --cfg experiments/cfgs/rgbd_scene_multi_rgbd.yml --rig data/RGBDScene/camera.json --kfusion 1
and get the Error in gdb :
(gdb) bt
#0 malloc_consolidate (av=av@entry=0x7ffff7bb4b20 <main_arena>) at malloc.c:4181
#1 0x00007ffff7871cde in _int_malloc (av=av@entry=0x7ffff7bb4b20 <main_arena>, bytes=bytes@entry=1024) at malloc.c:3450
#2 0x00007ffff7874184 in __GI___libc_malloc (bytes=1024) at malloc.c:2913
#3 0x00007fff973b7685 in __pyx_insert_code_object (code_object=0x7fff7e7c28b0, code_line=1390) at kinect_fusion/kfusion.cpp:6647
#4 __Pyx_AddTraceback (funcname=funcname@entry=0x7fff973c34c0 "kinect_fusion.kfusion.PyKinectFusion.cinit", c_line=c_line@entry=1390, py_line=py_line@entry=32,
filename=filename@entry=0x7fff973c2362 "kinect_fusion/kfusion.pyx") at kinect_fusion/kfusion.cpp:6750
#5 0x00007fff973b9931 in pyx_pf_13kinect_fusion_7kfusion_14PyKinectFusion___cinit (__pyx_v_self=0x7fff9d997c48, __pyx_v_rig_file="")
at kinect_fusion/kfusion.cpp:1406
#6 pyx_pw_13kinect_fusion_7kfusion_14PyKinectFusion_1__cinit (__pyx_kwds=