3D-BoNet icon indicating copy to clipboard operation
3D-BoNet copied to clipboard

训练自己数据集的问题

Open lifeiwen opened this issue 3 years ago • 17 comments

杨博士你好,我尝试用你的网络训练自己的数据集,训练的环境是: tensorflow1.15 cuda10 cudnn7.0.4 可以正常编译和读数据,但是训练到第30多个epoch时训练中断,训练日志如下,希望杨博士能够解答 epoch 32 end time is : 2021-01-08 14:21:29.075227 train files shuffled! is training ep : 33 total train batch num: 100 ep 33 i 0 psemce 0.0 bbvert -0.23279694 l2 0.060497742 ce 0.26028115 siou -0.5535758 bbscore 0.0038582273 pmask 0.6467816 ep 33 i 0 test psem 0.0 bbvert 1.9851223 l2 0.059121773 ce 2.2534707 siou -0.3274702 bbscore 0.0048341155 pmask 3.4519947 test pred bborder [[2 1 0]] ep 33 i 20 psemce 0.0 bbvert -0.44303733 l2 0.030050844 ce 0.16678412 siou -0.6398723 bbscore 0.0026201883 pmask 0.63400114 ep 33 i 20 test psem 0.0 bbvert -0.4450612 l2 0.04898257 ce 0.23810822 siou -0.732152 bbscore 0.0016184862 pmask 0.47476012 test pred bborder [[2 0 1]] ep 33 i 40 psemce 0.0 bbvert 0.43996847 l2 0.08725857 ce 0.8109299 siou -0.45822 bbscore 0.0034747643 pmask 0.9270937 ep 33 i 40 test psem 0.0 bbvert -0.07955924 l2 0.040802542 ce 0.4581066 siou -0.5784684 bbscore 0.0087565 pmask 1.0110209 test pred bborder [[2 0 1]] ep 33 i 60 psemce 0.0 bbvert 0.057684183 l2 0.071036406 ce 0.5709717 siou -0.58432394 bbscore 0.00046247765 pmask 0.48829234 ep 33 i 60 test psem 0.0 bbvert -0.3817188 l2 0.03684793 ce 0.2770622 siou -0.69562894 bbscore 0.0019779946 pmask 0.54431504 test pred bborder [[2 0 1]] ep 33 i 80 psemce 0.0 bbvert 0.038050413 l2 0.034902867 ce 0.6071266 siou -0.60397905 bbscore 0.015431552 pmask 0.8978894 ep 33 i 80 test psem 0.0 bbvert 1.1076844 l2 0.07785928 ce 1.3761435 siou -0.34631833 bbscore 0.033992507 pmask 1.7343999 test pred bborder [[2 0 1]] model saved in : ./log/train_mod/model033.cptk epoch 33 end time is : 2021-01-08 14:21:44.245053 train files shuffled! is training ep : 34 total train batch num: 100 ep 34 i 0 psemce 0.0 bbvert -0.41581324 l2 0.057975773 ce 0.28829214 siou -0.76208115 bbscore 0.0003172583 pmask 0.34254307 ep 34 i 0 test psem 0.0 bbvert 1.7912706 l2 0.08668331 ce 2.017744 siou -0.31315675 bbscore 0.00576146 pmask 2.1253805 test pred bborder [[0 2 1]] ep 34 i 20 psemce 0.0 bbvert -0.14073128 l2 0.034625944 ce 0.4937689 siou -0.6691261 bbscore 0.0047056335 pmask 0.7088615 ep 34 i 20 test psem 0.0 bbvert 1.9534252 l2 0.0907397 ce 2.1705353 siou -0.3078499 bbscore 0.0019757028 pmask 2.682012 test pred bborder [[0 2 1]] ep 34 i 40 psemce 0.0 bbvert 0.27091432 l2 0.053299602 ce 0.7194822 siou -0.5018675 bbscore 0.001409175 pmask 0.60204554 ep 34 i 40 test psem 0.0 bbvert -0.28416353 l2 0.09286666 ce 0.19349718 siou -0.5705274 bbscore 0.003192804 pmask 0.21589296 test pred bborder [[1 0 2]] 2021-01-08 14:21:52.432564: W tensorflow/core/framework/op_kernel.cc:1639] Invalid argument: ValueError: matrix contains invalid numeric entries Traceback (most recent call last):

File "/home/liu/anaconda3/envs/tf1.15/lib/python3.6/site-packages/tensorflow_core/python/ops/script_ops.py", line 235, in call ret = func(*args)

File "/home/liu/disk1/Life/3DBoNetPoint818a(linux)/helper_net.py", line 115, in assign_mappings_valid_only row_ind, col_ind = linear_sum_assignment(valid_cost)

File "/home/liu/anaconda3/envs/tf1.15/lib/python3.6/site-packages/scipy/optimize/_hungarian.py", line 93, in linear_sum_assignment raise ValueError("matrix contains invalid numeric entries")

ValueError: matrix contains invalid numeric entries

Traceback (most recent call last): File "/home/liu/anaconda3/envs/tf1.15/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call return fn(*args) File "/home/liu/anaconda3/envs/tf1.15/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn target_list, run_metadata) File "/home/liu/anaconda3/envs/tf1.15/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun run_metadata) tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found. (0) Invalid argument: ValueError: matrix contains invalid numeric entries Traceback (most recent call last):

File "/home/liu/anaconda3/envs/tf1.15/lib/python3.6/site-packages/tensorflow_core/python/ops/script_ops.py", line 235, in call ret = func(*args)

File "/home/liu/disk1/Life/3DBoNetPoint818a(linux)/helper_net.py", line 115, in assign_mappings_valid_only row_ind, col_ind = linear_sum_assignment(valid_cost)

File "/home/liu/anaconda3/envs/tf1.15/lib/python3.6/site-packages/scipy/optimize/_hungarian.py", line 93, in linear_sum_assignment raise ValueError("matrix contains invalid numeric entries")

ValueError: matrix contains invalid numeric entries

 [[{{node bbox/PyFunc}}]]
 [[gradients/backbone/fa_layer1/ThreeInterpolate_grad/ThreeInterpolateGrad/_425]]

(1) Invalid argument: ValueError: matrix contains invalid numeric entries Traceback (most recent call last):

File "/home/liu/anaconda3/envs/tf1.15/lib/python3.6/site-packages/tensorflow_core/python/ops/script_ops.py", line 235, in call ret = func(*args)

File "/home/liu/disk1/Life/3DBoNetPoint818a(linux)/helper_net.py", line 115, in assign_mappings_valid_only row_ind, col_ind = linear_sum_assignment(valid_cost)

File "/home/liu/anaconda3/envs/tf1.15/lib/python3.6/site-packages/scipy/optimize/_hungarian.py", line 93, in linear_sum_assignment raise ValueError("matrix contains invalid numeric entries")

ValueError: matrix contains invalid numeric entries

lifeiwen avatar Jan 08 '21 08:01 lifeiwen

Hi @lifeiwen, it seems that some value(s) of the cost matrix is nan? this numerical issue may happen when computing the costs (siou or ce).

Yang7879 avatar Mar 02 '21 17:03 Yang7879

Hi @Yang7879

Same error here, i am using the s3dis dataset.

What is the solution to this ?

piseabhijeet avatar Mar 31 '21 06:03 piseabhijeet

@Yang7879 I also used s3dis to train the model ,but i got the same error

lifeiwen avatar Mar 31 '21 06:03 lifeiwen

@Yang7879 I also used s3dis to train the model ,but i got the same error

Hi @lifeiwen

I used the solution from https://github.com/Yang7879/3D-BoNet/issues/24#issuecomment-666012822 but it still resulted into the same issue.

piseabhijeet avatar Mar 31 '21 07:03 piseabhijeet

Hi @piseabhijeet used Tensorflow1.14 and solved this problem ,I don't know why? you can try tenforflow1.14

lifeiwen avatar Mar 31 '21 07:03 lifeiwen

Hi @lifeiwen

You used TF 1.14 and the same source code for S3DIS dataset without any changes?

Thanks again for your response.

piseabhijeet avatar Mar 31 '21 08:03 piseabhijeet

Hi, @piseabhijeet Yes, I did not change the source code and data, what environment are you using? I have tried the version above 1.15, but this problem exists, but when I use the version below 1.15, the problem disappears, it may be caused by one of the functions.

lifeiwen avatar Mar 31 '21 09:03 lifeiwen

Hi @lifeiwen

Thank you for your quick response.

I just tried running the code on TF 1.13 without any changes and it is working fine so far:

image

Yes, i agree with your observation - it does not work on TF 1.15. Thanks to your inputs because of which i was able to quickly downgrade and experiment.

piseabhijeet avatar Mar 31 '21 10:03 piseabhijeet

@piseabhijeet ok, if you find a problem on version tf 1.15, please tell me the reason,thanks

lifeiwen avatar Mar 31 '21 12:03 lifeiwen

@lifeiwen - sure will do, thanks

piseabhijeet avatar Mar 31 '21 13:03 piseabhijeet

hi, I also want to train my own data, and can I ask you some questions by emai? my emai ([email protected])

clare19997 avatar Apr 20 '21 11:04 clare19997

@clare19997 [email protected]

lifeiwen avatar May 04 '21 12:05 lifeiwen

@clare19997 @lifeiwen can you please give me your data preprocessing code of dividing cloud into blocks to generate h5 file? my email is [email protected]. thanks in advance

souri1234 avatar May 15 '21 02:05 souri1234

How do you make your own dataset?

QingWindIsStillTheWind avatar Nov 22 '21 01:11 QingWindIsStillTheWind

杨博士你好,我尝试用你的网络训练自己的数据集,训练的环境是: tensorflow1.15 cuda10 cudnn7.0.4 可以正常编译和读数据,但是训练到第30多个epoch时训练中断,训练日志如下,希望杨博士能够解答 epoch 32 end time is : 2021-01-08 14:21:29.075227 train files shuffled! is training ep : 33 total train batch num: 100 ep 33 i 0 psemce 0.0 bbvert -0.23279694 l2 0.060497742 ce 0.26028115 siou -0.5535758 bbscore 0.0038582273 pmask 0.6467816 ep 33 i 0 test psem 0.0 bbvert 1.9851223 l2 0.059121773 ce 2.2534707 siou -0.3274702 bbscore 0.0048341155 pmask 3.4519947 test pred bborder [[2 1 0]] ep 33 i 20 psemce 0.0 bbvert -0.44303733 l2 0.030050844 ce 0.16678412 siou -0.6398723 bbscore 0.0026201883 pmask 0.63400114 ep 33 i 20 test psem 0.0 bbvert -0.4450612 l2 0.04898257 ce 0.23810822 siou -0.732152 bbscore 0.0016184862 pmask 0.47476012 test pred bborder [[2 0 1]] ep 33 i 40 psemce 0.0 bbvert 0.43996847 l2 0.08725857 ce 0.8109299 siou -0.45822 bbscore 0.0034747643 pmask 0.9270937 ep 33 i 40 test psem 0.0 bbvert -0.07955924 l2 0.040802542 ce 0.4581066 siou -0.5784684 bbscore 0.0087565 pmask 1.0110209 test pred bborder [[2 0 1]] ep 33 i 60 psemce 0.0 bbvert 0.057684183 l2 0.071036406 ce 0.5709717 siou -0.58432394 bbscore 0.00046247765 pmask 0.48829234 ep 33 i 60 test psem 0.0 bbvert -0.3817188 l2 0.03684793 ce 0.2770622 siou -0.69562894 bbscore 0.0019779946 pmask 0.54431504 test pred bborder [[2 0 1]] ep 33 i 80 psemce 0.0 bbvert 0.038050413 l2 0.034902867 ce 0.6071266 siou -0.60397905 bbscore 0.015431552 pmask 0.8978894 ep 33 i 80 test psem 0.0 bbvert 1.1076844 l2 0.07785928 ce 1.3761435 siou -0.34631833 bbscore 0.033992507 pmask 1.7343999 test pred bborder [[2 0 1]] model saved in : ./log/train_mod/model033.cptk epoch 33 end time is : 2021-01-08 14:21:44.245053 train files shuffled! is training ep : 34 total train batch num: 100 ep 34 i 0 psemce 0.0 bbvert -0.41581324 l2 0.057975773 ce 0.28829214 siou -0.76208115 bbscore 0.0003172583 pmask 0.34254307 ep 34 i 0 test psem 0.0 bbvert 1.7912706 l2 0.08668331 ce 2.017744 siou -0.31315675 bbscore 0.00576146 pmask 2.1253805 test pred bborder [[0 2 1]] ep 34 i 20 psemce 0.0 bbvert -0.14073128 l2 0.034625944 ce 0.4937689 siou -0.6691261 bbscore 0.0047056335 pmask 0.7088615 ep 34 i 20 test psem 0.0 bbvert 1.9534252 l2 0.0907397 ce 2.1705353 siou -0.3078499 bbscore 0.0019757028 pmask 2.682012 test pred bborder [[0 2 1]] ep 34 i 40 psemce 0.0 bbvert 0.27091432 l2 0.053299602 ce 0.7194822 siou -0.5018675 bbscore 0.001409175 pmask 0.60204554 ep 34 i 40 test psem 0.0 bbvert -0.28416353 l2 0.09286666 ce 0.19349718 siou -0.5705274 bbscore 0.003192804 pmask 0.21589296 test pred bborder [[1 0 2]] 2021-01-08 14:21:52.432564: W tensorflow/core/framework/op_kernel.cc:1639] Invalid argument: ValueError: matrix contains invalid numeric entries Traceback (most recent call last):

File "/home/liu/anaconda3/envs/tf1.15/lib/python3.6/site-packages/tensorflow_core/python/ops/script_ops.py", line 235, in call ret = func(*args)

File "/home/liu/disk1/Life/3DBoNetPoint818a(linux)/helper_net.py", line 115, in assign_mappings_valid_only row_ind, col_ind = linear_sum_assignment(valid_cost)

File "/home/liu/anaconda3/envs/tf1.15/lib/python3.6/site-packages/scipy/optimize/_hungarian.py", line 93, in linear_sum_assignment raise ValueError("matrix contains invalid numeric entries")

ValueError: matrix contains invalid numeric entries

Traceback (most recent call last): File "/home/liu/anaconda3/envs/tf1.15/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call return fn(*args) File "/home/liu/anaconda3/envs/tf1.15/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn target_list, run_metadata) File "/home/liu/anaconda3/envs/tf1.15/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun run_metadata) tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found. (0) Invalid argument: ValueError: matrix contains invalid numeric entries Traceback (most recent call last):

File "/home/liu/anaconda3/envs/tf1.15/lib/python3.6/site-packages/tensorflow_core/python/ops/script_ops.py", line 235, in call ret = func(*args)

File "/home/liu/disk1/Life/3DBoNetPoint818a(linux)/helper_net.py", line 115, in assign_mappings_valid_only row_ind, col_ind = linear_sum_assignment(valid_cost)

File "/home/liu/anaconda3/envs/tf1.15/lib/python3.6/site-packages/scipy/optimize/_hungarian.py", line 93, in linear_sum_assignment raise ValueError("matrix contains invalid numeric entries")

ValueError: matrix contains invalid numeric entries

 [[{{node bbox/PyFunc}}]]
 [[gradients/backbone/fa_layer1/ThreeInterpolate_grad/ThreeInterpolateGrad/_425]]

(1) Invalid argument: ValueError: matrix contains invalid numeric entries Traceback (most recent call last):

File "/home/liu/anaconda3/envs/tf1.15/lib/python3.6/site-packages/tensorflow_core/python/ops/script_ops.py", line 235, in call ret = func(*args)

File "/home/liu/disk1/Life/3DBoNetPoint818a(linux)/helper_net.py", line 115, in assign_mappings_valid_only row_ind, col_ind = linear_sum_assignment(valid_cost)

File "/home/liu/anaconda3/envs/tf1.15/lib/python3.6/site-packages/scipy/optimize/_hungarian.py", line 93, in linear_sum_assignment raise ValueError("matrix contains invalid numeric entries")

ValueError: matrix contains invalid numeric entries

请问您是如何准备自己的数据集的,我的是深度相机拍出来的.ply格式,该怎样转成网络中需要的.h5格式呢?

PengboLi1998 avatar Mar 27 '22 13:03 PengboLi1998

杨博士你好,我尝试用你的网络训练自己的数据集,训练的环境是: tensorflow1.15 cuda10 cudnn7.0.4 可以正常编译和读数据,但是训练到第30多个epoch时训练中断,训练日志如下,希望杨博士能够解答 epoch 32 end time is : 2021-01-08 14:21:29.075227 train files shuffled! is training ep : 33 total train batch num: 100 ep 33 i 0 psemce 0.0 bbvert -0.23279694 l2 0.060497742 ce 0.26028115 siou -0.5535758 bbscore 0.0038582273 pmask 0.6467816 ep 33 i 0 test psem 0.0 bbvert 1.9851223 l2 0.059121773 ce 2.2534707 siou -0.3274702 bbscore 0.0048341155 pmask 3.4519947 test pred bborder [[2 1 0]] ep 33 i 20 psemce 0.0 bbvert -0.44303733 l2 0.030050844 ce 0.16678412 siou -0.6398723 bbscore 0.0026201883 pmask 0.63400114 ep 33 i 20 test psem 0.0 bbvert -0.4450612 l2 0.04898257 ce 0.23810822 siou -0.732152 bbscore 0.0016184862 pmask 0.47476012 test pred bborder [[2 0 1]] ep 33 i 40 psemce 0.0 bbvert 0.43996847 l2 0.08725857 ce 0.8109299 siou -0.45822 bbscore 0.0034747643 pmask 0.9270937 ep 33 i 40 test psem 0.0 bbvert -0.07955924 l2 0.040802542 ce 0.4581066 siou -0.5784684 bbscore 0.0087565 pmask 1.0110209 test pred bborder [[2 0 1]] ep 33 i 60 psemce 0.0 bbvert 0.057684183 l2 0.071036406 ce 0.5709717 siou -0.58432394 bbscore 0.00046247765 pmask 0.48829234 ep 33 i 60 test psem 0.0 bbvert -0.3817188 l2 0.03684793 ce 0.2770622 siou -0.69562894 bbscore 0.0019779946 pmask 0.54431504 test pred bborder [[2 0 1]] ep 33 i 80 psemce 0.0 bbvert 0.038050413 l2 0.034902867 ce 0.6071266 siou -0.60397905 bbscore 0.015431552 pmask 0.8978894 ep 33 i 80 test psem 0.0 bbvert 1.1076844 l2 0.07785928 ce 1.3761435 siou -0.34631833 bbscore 0.033992507 pmask 1.7343999 test pred bborder [[2 0 1]] model saved in : ./log/train_mod/model033.cptk epoch 33 end time is : 2021-01-08 14:21:44.245053 train files shuffled! is training ep : 34 total train batch num: 100 ep 34 i 0 psemce 0.0 bbvert -0.41581324 l2 0.057975773 ce 0.28829214 siou -0.76208115 bbscore 0.0003172583 pmask 0.34254307 ep 34 i 0 test psem 0.0 bbvert 1.7912706 l2 0.08668331 ce 2.017744 siou -0.31315675 bbscore 0.00576146 pmask 2.1253805 test pred bborder [[0 2 1]] ep 34 i 20 psemce 0.0 bbvert -0.14073128 l2 0.034625944 ce 0.4937689 siou -0.6691261 bbscore 0.0047056335 pmask 0.7088615 ep 34 i 20 test psem 0.0 bbvert 1.9534252 l2 0.0907397 ce 2.1705353 siou -0.3078499 bbscore 0.0019757028 pmask 2.682012 test pred bborder [[0 2 1]] ep 34 i 40 psemce 0.0 bbvert 0.27091432 l2 0.053299602 ce 0.7194822 siou -0.5018675 bbscore 0.001409175 pmask 0.60204554 ep 34 i 40 test psem 0.0 bbvert -0.28416353 l2 0.09286666 ce 0.19349718 siou -0.5705274 bbscore 0.003192804 pmask 0.21589296 test pred bborder [[1 0 2]] 2021-01-08 14:21:52.432564: W tensorflow/core/framework/op_kernel.cc:1639] Invalid argument: ValueError: matrix contains invalid numeric entries Traceback (most recent call last): File "/home/liu/anaconda3/envs/tf1.15/lib/python3.6/site-packages/tensorflow_core/python/ops/script_ops.py", line 235, in call ret = func(*args) File "/home/liu/disk1/Life/3DBoNetPoint818a(linux)/helper_net.py", line 115, in assign_mappings_valid_only row_ind, col_ind = linear_sum_assignment(valid_cost) File "/home/liu/anaconda3/envs/tf1.15/lib/python3.6/site-packages/scipy/optimize/_hungarian.py", line 93, in linear_sum_assignment raise ValueError("matrix contains invalid numeric entries") ValueError: matrix contains invalid numeric entries Traceback (most recent call last): File "/home/liu/anaconda3/envs/tf1.15/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call return fn(*args) File "/home/liu/anaconda3/envs/tf1.15/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn target_list, run_metadata) File "/home/liu/anaconda3/envs/tf1.15/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun run_metadata) tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found. (0) Invalid argument: ValueError: matrix contains invalid numeric entries Traceback (most recent call last): File "/home/liu/anaconda3/envs/tf1.15/lib/python3.6/site-packages/tensorflow_core/python/ops/script_ops.py", line 235, in call ret = func(*args) File "/home/liu/disk1/Life/3DBoNetPoint818a(linux)/helper_net.py", line 115, in assign_mappings_valid_only row_ind, col_ind = linear_sum_assignment(valid_cost) File "/home/liu/anaconda3/envs/tf1.15/lib/python3.6/site-packages/scipy/optimize/_hungarian.py", line 93, in linear_sum_assignment raise ValueError("matrix contains invalid numeric entries") ValueError: matrix contains invalid numeric entries

 [[{{node bbox/PyFunc}}]]
 [[gradients/backbone/fa_layer1/ThreeInterpolate_grad/ThreeInterpolateGrad/_425]]

(1) Invalid argument: ValueError: matrix contains invalid numeric entries Traceback (most recent call last): File "/home/liu/anaconda3/envs/tf1.15/lib/python3.6/site-packages/tensorflow_core/python/ops/script_ops.py", line 235, in call ret = func(*args) File "/home/liu/disk1/Life/3DBoNetPoint818a(linux)/helper_net.py", line 115, in assign_mappings_valid_only row_ind, col_ind = linear_sum_assignment(valid_cost) File "/home/liu/anaconda3/envs/tf1.15/lib/python3.6/site-packages/scipy/optimize/_hungarian.py", line 93, in linear_sum_assignment raise ValueError("matrix contains invalid numeric entries") ValueError: matrix contains invalid numeric entries

请问您是如何准备自己的数据集的,我的是深度相机拍出来的.ply格式,该怎样转成网络中需要的.h5格式呢? 您好,请问您如何处理自己的数据集呢?

zhongxiaj avatar Jul 12 '22 03:07 zhongxiaj

For those who see the error when adjusting the parameters such as learning rate(like what I just experienced), maybe a too large learning rate is the issue and somehow it's diverging and out of control. Change it to a small learning rate should fix it(in my case).

pingapplepen avatar Oct 07 '22 10:10 pingapplepen