pointnet2 icon indicating copy to clipboard operation
pointnet2 copied to clipboard

Error when training

Open pauloffsf opened this issue 5 years ago • 15 comments

Hello!

I was looking here if anyone has found the same error as I did, but couldn't find anything.

I am trying to train the semantic segmentation trainning code, with my own dataset.

I was able to follow properly the readme, install all the dependencies and compile all the tf_ops SOs, but when I try to run the trainning, I get an error on the tf_sampling_so.so of undefined symbol (I've attached the image with the error):

IMG_20190312_145653 1

Has anyone seen this kind of problem and knows how to solve it? I'm using Cuda 9.0 and TF 1.13

Thanks!

pauloffsf avatar Mar 12 '19 19:03 pauloffsf

I have faced similar problem. tf_sampling_so.so: undefined symbol: _ZN10tensorflow8internal21CheckOpMessageBuilder9NewStringEv I'm using Cuda9.0 and TF 1.12

Kevinlongran avatar Mar 14 '19 03:03 Kevinlongran

I thought that maybe I used the unmatched tf version.

Kevinlongran avatar Mar 14 '19 03:03 Kevinlongran

Maybe you can find the solution by following https://github.com/charlesq34/pointnet2/issues/48

zhangxing1995 avatar Mar 15 '19 05:03 zhangxing1995

Hello, I'm using TF-gpu 1.13.0 and Cuda10.0 on Ubuntu 18.04 and i still have this problem when i run train.py. is this problem being produced because of TF and Cuda version ? PS : I couldn't use Cuda 9.0 on my Ubuntu version because of g++ version (> 6.0).

SalaheddineSTA avatar May 10 '19 14:05 SalaheddineSTA

Hello i am facing similar issue tensorflow.python.framework.errors_impl.NotFoundError: /mnt/disks/user/project/pointnet2/tf_ops/sampling/tf_sampling_so.so: undefined symbol: _ZTVN10tensorflow14kernel_factory17OpKernelRegistrar18PtrOpKernelFactoryE with gcc > 6 i tried to solve with above mentioned #48 but i am getting the same error

kiranintellify avatar Jul 31 '19 13:07 kiranintellify

@pauloffsf @Kevinlongran @zhangxing1995 @SalaheddineSTA @kiranintellify have you solve the problem? i met the same problem as you. i uses Cuda9.0 and TF 1.12

MrCrazyCrab avatar Nov 19 '19 09:11 MrCrazyCrab

I was able to solve it, a while back, but I lost my tutorial with all the steps I did to solve it. But it was mostly the way you compile the c/c++ codes.

pauloffsf avatar Nov 19 '19 10:11 pauloffsf

@pauloffsf i can successfully in compile the codes, there is something wrong when i load the .so files. I f it's ok, could you share your .so files fo me?

MrCrazyCrab avatar Nov 20 '19 01:11 MrCrazyCrab

It is related with the compiling options you use. I don't have the .so anymore. they were also with the tutorial I created. The HD of the computer had a problem and they had to format the computer. After that, I was no longer able to work with the code again.

I also used this https://github.com/pubgeo/dfc2019/tree/master/track4/pointnet2 to help me compile.

I am going to work again with this code in 2 weeks from now. If I come up with how to solve it again, I'll let you know.

pauloffsf avatar Nov 20 '19 02:11 pauloffsf

@MrCrazyCrab sorry it took me so long to answer, but could you solve your problem?

I could solve it with this:

Besides taking "-D_GLIBCXX_USE_CXX11_ABI=0" parameter of the g++, I got to fix my problem with this:

you need to see if the -ltensorflow_framework was linked properlly in your tf_ops *.so. For that, use:

$ldd tf_grouping_so.so (for example)

check if the libtensorflow_framework.so is in the list (2). If it isn't, you haven't linked it properly (1).

  1. This may happen if your library is in another version, and it is something like *.so.x, where x is a number of the version. If this is the case, you need to create a symbolic link from a *.so to *.so.x:

$sudo ln -s libtensorflow_framework.so.x libtensorflow_framework.so You then, have to compile every tf_op again, and try checking the ldd again.

  1. If it is in the list, but it checks as not found, you just need to add its path to the ld library path: $export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:(path to were libtensorflow_framework.so is) $sudo ldconfig

That's it.

you can check again with ldd to see if the library is there in the list and being properly found. and run your train.py

I was able to solve it in an anaconda python 3.7, with g++ 7.5, tensorflow 2.1

pauloffsf avatar Apr 02 '20 23:04 pauloffsf

@pauloffsf Thanks all the same, i have slolved the problem. i made it by setting up the right virtual environment. However, I couldn't succeed in the tensorflow1.14 beacuse the file *.so.x , and you suggestion would be a solution to that.

MrCrazyCrab avatar Apr 03 '20 00:04 MrCrazyCrab

@MrCrazyCrab sorry it took me so long to answer, but could you solve your problem?

I could solve it with this:

Besides taking "-D_GLIBCXX_USE_CXX11_ABI=0" parameter of the g++, I got to fix my problem with this:

you need to see if the -ltensorflow_framework was linked properlly in your tf_ops *.so. For that, use:

$ldd tf_grouping_so.so (for example)

check if the libtensorflow_framework.so is in the list (2). If it isn't, you haven't linked it properly (1).

  1. This may happen if your library is in another version, and it is something like *.so.x, where x is a number of the version. If this is the case, you need to create a symbolic link from a *.so to *.so.x:

$sudo ln -s libtensorflow_framework.so.x libtensorflow_framework.so You then, have to compile every tf_op again, and try checking the ldd again.

  1. If it is in the list, but it checks as not found, you just need to add its path to the ld library path: $export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:(path to were libtensorflow_framework.so is) $sudo ldconfig

That's it.

you can check again with ldd to see if the library is there in the list and being properly found. and run your train.py

I was able to solve it in an anaconda python 3.7, with g++ 7.5, tensorflow 2.1

Hello, I have a problem in the implementation process according to your method: using ldd tf_distance_so.so, there is no libtensorflow_framework.so in the output list, and I have not found a version similar to libtensorflow_framework.so.x in the library, I hope you can take the time to see what the problem is. Thank you! ! ! My environment is as follows: ubuntu16.04; ubuntu-drivers 440.64; cuda 10.0; cudnn 7.5.0; tf 1.10.0; gcc/g++ 5.4; image Thans in advance!!!!!

skq-5233 avatar Jun 20 '20 09:06 skq-5233

@MrCrazyCrab sorry it took me so long to answer, but could you solve your problem?

I could solve it with this:

Besides taking "-D_GLIBCXX_USE_CXX11_ABI=0" parameter of the g++, I got to fix my problem with this:

you need to see if the -ltensorflow_framework was linked properlly in your tf_ops *.so. For that, use:

$ldd tf_grouping_so.so (for example)

check if the libtensorflow_framework.so is in the list (2). If it isn't, you haven't linked it properly (1).

  1. This may happen if your library is in another version, and it is something like *.so.x, where x is a number of the version. If this is the case, you need to create a symbolic link from a *.so to *.so.x:

$sudo ln -s libtensorflow_framework.so.x libtensorflow_framework.so You then, have to compile every tf_op again, and try checking the ldd again.

  1. If it is in the list, but it checks as not found, you just need to add its path to the ld library path: $export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:(path to were libtensorflow_framework.so is) $sudo ldconfig

That's it.

you can check again with ldd to see if the library is there in the list and being properly found. and run your train.py

I was able to solve it in an anaconda python 3.7, with g++ 7.5, tensorflow 2.1

my makefile as follows: image

skq-5233 avatar Jun 20 '20 09:06 skq-5233

Have you check the if the libtensorflow is in the folder \home\user\anaconda2\envs\planenet\lib\python2.7(...)\tensorflow ? that's where it should be and usually it is *.so.x where x is another number.

What tensorflow version have you installed?

pauloffsf avatar Jun 20 '20 10:06 pauloffsf

This is libtensorflow's path :

------------------ 原始邮件 ------------------ 发件人: "pauloffsf"<[email protected]>; 发送时间: 2020年6月20日(星期六) 晚上6:43 收件人: "charlesq34/pointnet2"<[email protected]>; 抄送: "Dandelion's Fled"<[email protected]>;"Comment"<[email protected]>; 主题: Re: [charlesq34/pointnet2] Error when training (#111)

Have you check the if the libtensorflow is in the folder \home\user\anaconda2\envs\planenet\lib\python2.7(...)\tensorflow ? that's where it should be and usually it is *.so.x where x is another number.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

skq-5233 avatar Jun 20 '20 13:06 skq-5233