DA-RNN icon indicating copy to clipboard operation
DA-RNN copied to clipboard

Error on running test code

Open AliBuildsAI opened this issue 7 years ago • 22 comments

screenshot from 2017-08-14 19-27-53

When I run test_net.py, I encounter CUDA memory related errors (e.g. segmentation fault, CUDA error: an illegal memory access was encountered, etc). Error messages change from time to time. Anyone with the similar problems?

AliBuildsAI avatar Aug 15 '17 02:08 AliBuildsAI

What kind of GPU do you use? DA-RNN needs at least 6gb I think. However, it may be related to other issues which different third party libraries that need to be installed correctly, see #2 / #10 . Also what kind of CUDA, cuDNN, TensorFlow and Ubuntu are you using?

kevinkit avatar Aug 15 '17 07:08 kevinkit

I ran the code for training with no problem, so there is probably no problem with dependencies. I have a TITAN X and a Geforce GTX gpu. CUDA version: 8.0.61 CuDNN: 5.1 Ubuntu: 16.04 Tensorflow version: 1.2.1

AliBuildsAI avatar Aug 15 '17 07:08 AliBuildsAI

Do you give the device ID as an input parameter to your script?

Check with nvidia-smi the ID of your Titan GPU and parse it to the script. I do not know which kind of GeForce GTX GPU you have, but a TITAN should run just fine. (However, the test script did not work yet, cause of #9 )

Btw, the training scripts have not been an issue ever, while the test scripts seems to be the trouble maker.

kevinkit avatar Aug 15 '17 07:08 kevinkit

Yes, the device ID is 0. This is the command I ran: ./experiments/scripts/rgbd_scene_multi_rgbd_test.sh

And here is the inside of rgbd_scene_multi_rgbd_test.sh:

#!/bin/bash

set -x set -e

export PYTHONUNBUFFERED="True" export CUDA_VISIBLE_DEVICES=$1 #export LD_PRELOAD=/usr/lib/libtcmalloc.so.4

LOG="experiments/logs/rgbd_scene_multi_rgbd.txt.date +'%Y-%m-%d_%H-%M-%S'" exec &> >(tee -a "$LOG") echo Logging output to "$LOG"

train FCN for multiple frames

time ./tools/train_net.py --gpu 0
--network vgg16
--weights data/imagenet_models/vgg16_convs.npy
--imdb rgbd_scene_train
--cfg experiments/cfgs/rgbd_scene_multi_rgbd.yml
--iters 40000

if [ -f $PWD/output/rgbd_scene/rgbd_scene_val/vgg16_fcn_rgbd_multi_frame_rgbd_scene_iter_40000/segmentations.pkl ] then rm $PWD/output/rgbd_scene/rgbd_scene_val/vgg16_fcn_rgbd_multi_frame_rgbd_scene_iter_40000/segmentations.pkl fi

test FCN for multiple frames

time ./tools/test_net.py --gpu 0
--network vgg16
--model output/rgbd_scene/rgbd_scene_train/vgg16_fcn_rgbd_multi_frame_rgbd_scene_iter_40000.ckpt
--imdb rgbd_scene_val
--cfg experiments/cfgs/rgbd_scene_multi_rgbd.yml
--rig data/RGBDScene/camera.json --kfusion 1

AliBuildsAI avatar Aug 15 '17 07:08 AliBuildsAI

have you tried running: ./experiments/scripts/rgbd_scene_multi_rgbd_test.sh 0 instead?

kevinkit avatar Aug 15 '17 08:08 kevinkit

maybe try running it with sudo

kevinkit avatar Aug 15 '17 12:08 kevinkit

The testing code calls the c++ KinectFusion library in Python. This step is not stable. I also encountered crashes, due to some malloc issue inside python. You can debug by running "gdb --args python ./tools/test_net.py --gpu 0 --network vgg16 --model output/rgbd_scene/rgbd_scene_train/vgg16_fcn_rgbd_multi_frame_rgbd_scene_iter_40000.ckpt --imdb rgbd_scene_val --cfg experiments/cfgs/rgbd_scene_multi_rgbd.yml --rig data/RGBDScene/camera.json", and backtrace to see the problem.

yuxng avatar Aug 15 '17 20:08 yuxng

I ran this and there was no problem, but when I added --kfusion 1 at the end, I encountered this error:

[New Thread 0x7ffe65ffb700 (LWP 8553)] [New Thread 0x7ffe667fc700 (LWP 8554)] [New Thread 0x7ffe67fff700 (LWP 8555)] [New Thread 0x7ffe677fe700 (LWP 8556)] [New Thread 0x7ffe66ffd700 (LWP 8557)] [New Thread 0x7ffe5e22a700 (LWP 8558)] [New Thread 0x7ffe5da29700 (LWP 8559)] [New Thread 0x7ffe5d228700 (LWP 8560)] [New Thread 0x7ffe5ca27700 (LWP 8561)] [New Thread 0x7ffe4ffff700 (LWP 8562)] [New Thread 0x7ffe4f7fe700 (LWP 8563)] [New Thread 0x7ffe4effd700 (LWP 8564)] [New Thread 0x7ffe4e7fc700 (LWP 8565)] [New Thread 0x7ffe4dffb700 (LWP 8566)] [New Thread 0x7ffe4d7fa700 (LWP 8567)]

Thread 1 "python" received signal SIGSEGV, Segmentation fault. __memmove_avx_unaligned () at ../sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S:245 245 ../sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S: No such file or directory. (gdb) quit A debugging session is active.

Inferior 1 [process 7937] will be killed.

AliBuildsAI avatar Aug 16 '17 00:08 AliBuildsAI

@kevinkit the same happens when I add 0 at the end of the command.

When I ran it with sudo, this error happens:

  • set -e
  • export PYTHONUNBUFFERED=True
  • PYTHONUNBUFFERED=True
  • export CUDA_VISIBLE_DEVICES=0
  • CUDA_VISIBLE_DEVICES=0 ++ date +%Y-%m-%d_%H-%M-%S
  • LOG=experiments/logs/rgbd_scene_multi_rgbd_test.txt.2017-08-15_17-08-27
  • exec ++ tee -a experiments/logs/rgbd_scene_multi_rgbd_test.txt.2017-08-15_17-08-27
  • echo Logging output to experiments/logs/rgbd_scene_multi_rgbd_test.txt.2017-08-15_17-08-27 Logging output to experiments/logs/rgbd_scene_multi_rgbd_test.txt.2017-08-15_17-08-27
  • '[' -f /home/aliman/DA-RNN-master/output/rgbd_scene/rgbd_scene_val/vgg16_fcn_rgbd_multi_frame_rgbd_scene_iter_40000/segmentations.pkl ']'
  • ./tools/test_net.py --gpu 0 --network vgg16 --model data/fcn_models/rgbd_scene/vgg16_fcn_rgbd_multi_frame_rgbd_scene_iter_40000.ckpt --imdb rgbd_scene_val --cfg experiments/cfgs/rgbd_scene_multi_rgbd.yml --rig data/RGBDScene/camera.json --kfusion 1 Traceback (most recent call last): File "./tools/test_net.py", line 13, in from fcn.test import test_net File "/home/aliman/DA-RNN-master/tools/../lib/fcn/test.py", line 25, in from kinect_fusion import kfusion ImportError: libkfusion.so: cannot open shared object file: No such file or directory

(But I have libkfusion.so in DA-RNN/lib/kinect_fusion/build directory)

AliBuildsAI avatar Aug 16 '17 00:08 AliBuildsAI

Have you ever solved the problem?I encounter the same situation and I don't know how to work it our

doomxhc avatar Sep 13 '17 08:09 doomxhc

Like mentioned by @yuxng before, you can try to backtrace the problem with the gdb debugger, with the command like mentioned before:

"You can debug by running "gdb --args python ./tools/test_net.py --gpu 0 --network vgg16 --model output/rgbd_scene/rgbd_scene_train/vgg16_fcn_rgbd_multi_frame_rgbd_scene_iter_40000.ckpt --imdb rgbd_scene_val --cfg experiments/cfgs/rgbd_scene_multi_rgbd.yml --rig data/RGBDScene/camera.json", and backtrace to see the problem."

kevinkit avatar Sep 13 '17 08:09 kevinkit

@AliManUtd1993 ,do you compile the DA-RNN succesful? I always encounter the error in Kinect_Fusion

lizhihuit avatar Nov 02 '17 14:11 lizhihuit

I compiled all parts except kinect_fusion part.

AliBuildsAI avatar Nov 02 '17 23:11 AliBuildsAI

@AliManUtd1993 , do you compile the DA-RNN succesfully now? When I test_kinect_fusion.sh , it always show

ImportError: libkfusion.so: cannot open shared object file: No such file or directory

But libkfusion.so is in lib/kinect_fusion/build. And others can run succesfully.

baolinv0 avatar Dec 21 '17 08:12 baolinv0

No, I did not try anymore.

AliBuildsAI avatar Dec 21 '17 08:12 AliBuildsAI

Thank you for your quick reply.

baolinv0 avatar Dec 21 '17 09:12 baolinv0

@yuxng @kevinkit I meet same problem and I find the error happend at kinect_fusion.cpp => create_tensors() => initMarchingCubesTables();

And I run "You can debug by running "gdb --args python ./tools/test_net.py --gpu 0 --network vgg16 --model output/rgbd_scene/rgbd_scene_train/vgg16_fcn_rgbd_multi_frame_rgbd_scene_iter_40000.ckpt --imdb rgbd_scene_val --cfg experiments/cfgs/rgbd_scene_multi_rgbd.yml --rig data/RGBDScene/camera.json", and backtrace to see the problem."

it shows #6 0x00007ffff7814f45 in __libc_start_main (main=0x466e50

, argc=14, argv=0x7fffffffdc98, init=, fini=, rtld_fini=, stack_end=0x7fffffffdc88) at libc-start.c:287 #7 0x0000000000577c2e in _start ()

baolinv0 avatar Dec 21 '17 14:12 baolinv0

Hi, @beginnerFighting

Have you solved the problem "ImportError: libkfusion.so: cannot open shared object file: No such file or directory" ?

Thanks for your reply!

Ramay7 avatar Mar 30 '18 04:03 Ramay7

Hi, @Ramay7 , I also got the same error as you got. I am wondering if you have solved the issue or any suggestions. Thanks for your help!

Wei2624 avatar Jun 17 '18 16:06 Wei2624

Hi, @Wei2624 . I have gave up on this project and didn't find any solution, sorry.... :(

Ramay7 avatar Jun 18 '18 14:06 Ramay7

Hi, @beginnerFighting

Have you solved the problem "ImportError: libkfusion.so: cannot open shared object file: No such file or directory" ?

Thanks for your reply!

I think you forget this step: Add the KinectFusion libary path Shell export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$ROOT/lib/kinect_fusion/build Every time I start the computer this step must be excuted,otherwise you'll meet that Error.

gaochuan2017 avatar Jan 11 '19 08:01 gaochuan2017

The testing code calls the c++ KinectFusion library in Python. This step is not stable. I also encountered crashes, due to some malloc issue inside python. You can debug by running "gdb --args python ./tools/test_net.py --gpu 0 --network vgg16 --model output/rgbd_scene/rgbd_scene_train/vgg16_fcn_rgbd_multi_frame_rgbd_scene_iter_40000.ckpt --imdb rgbd_scene_val --cfg experiments/cfgs/rgbd_scene_multi_rgbd.yml --rig data/RGBDScene/camera.json", and backtrace to see the problem.

@yuxng I want to know how you address the malloc issue you mentioned...It seems that I meet the same Error as you... I test the trained model with the commands : sudo gdb --args python ./tools/test_net.py --gpu 0 --network vgg16 --model data/fcn_models/rgbd_scene/vgg16_fcn_rgbd_multi_frame_rgbd_scene_iter_40000.ckpt --imdb rgbd_scene_val --cfg experiments/cfgs/rgbd_scene_multi_rgbd.yml --rig data/RGBDScene/camera.json --kfusion 1

and get the Error in gdb :

(gdb) bt #0 malloc_consolidate (av=av@entry=0x7ffff7bb4b20 <main_arena>) at malloc.c:4181 #1 0x00007ffff7871cde in _int_malloc (av=av@entry=0x7ffff7bb4b20 <main_arena>, bytes=bytes@entry=1024) at malloc.c:3450 #2 0x00007ffff7874184 in __GI___libc_malloc (bytes=1024) at malloc.c:2913 #3 0x00007fff973b7685 in __pyx_insert_code_object (code_object=0x7fff7e7c28b0, code_line=1390) at kinect_fusion/kfusion.cpp:6647 #4 __Pyx_AddTraceback (funcname=funcname@entry=0x7fff973c34c0 "kinect_fusion.kfusion.PyKinectFusion.cinit", c_line=c_line@entry=1390, py_line=py_line@entry=32, filename=filename@entry=0x7fff973c2362 "kinect_fusion/kfusion.pyx") at kinect_fusion/kfusion.cpp:6750 #5 0x00007fff973b9931 in pyx_pf_13kinect_fusion_7kfusion_14PyKinectFusion___cinit (__pyx_v_self=0x7fff9d997c48, __pyx_v_rig_file="") at kinect_fusion/kfusion.cpp:1406 #6 pyx_pw_13kinect_fusion_7kfusion_14PyKinectFusion_1__cinit (__pyx_kwds=, __pyx_args=, __pyx_v_self=0x7fff9d997c48) at kinect_fusion/kfusion.cpp:1363 #7 __pyx_tp_new_13kinect_fusion_7kfusion_PyKinectFusion (t=, a=, k=) at kinect_fusion/kfusion.cpp:5068 #8 0x00000000004aaa15 in ?? () #9 0x00000000004c166d in PyEval_EvalFrameEx () #10 0x00000000004c141f in PyEval_EvalFrameEx () #11 0x00000000004b9b66 in PyEval_EvalCodeEx () #12 0x00000000004eb69f in ?? () #13 0x00000000004e58f2 in PyRun_FileExFlags () #14 0x00000000004e41a6 in PyRun_SimpleFileExFlags () #15 0x00000000004938ce in Py_Main () #16 0x00007ffff7810830 in __libc_start_main (main=0x493370

, argc=16, argv=0x7fffffffe418, init=, fini=, rtld_fini=, stack_end=0x7fffffffe408) at ../csu/libc-start.c:291 #17 0x0000000000493299 in _start ()

gaochuan2017 avatar Jan 12 '19 13:01 gaochuan2017