FPN_Tensorflow
FPN_Tensorflow copied to clipboard
请问,VOC数据集上训练,rpn loss不收敛,有什么可能的原因吗?
rpn_loc_loss 在某些图片上是0.xx,在有些图片上有30多
最后拿模型做test也没有正确的结果,所有图片都认为是0,也没有预测框
求助,谢谢:(
贴一段训练的log,这是将学习率调整为0.0001之后的log
2018-06-14 12:20:30: step60403 image_name:008105.jpg |
rpn_loc_loss:2.05723357201 | rpn_cla_loss:0.900643050671 | rpn_total_loss:2.95787668228 |
fast_rcnn_loc_loss:0.29182729125 | fast_rcnn_cla_loss:0.494256854057 | fast_rcnn_total_loss:0.78608417511 |
total_loss:4.47591352463 | pre_cost_time:0.469601869583s
2018-06-14 12:20:35: step60413 image_name:004556.jpg |
rpn_loc_loss:19.2806301117 | rpn_cla_loss:0.116126872599 | rpn_total_loss:19.3967571259 |
fast_rcnn_loc_loss:0.03605248034 | fast_rcnn_cla_loss:0.0864197462797 | fast_rcnn_total_loss:0.12247222662 |
total_loss:20.2511844635 | pre_cost_time:0.511494874954s
2018-06-14 12:20:40: step60423 image_name:007907.jpg |
rpn_loc_loss:29.0486793518 | rpn_cla_loss:0.16169847548 | rpn_total_loss:29.2103786469 |
fast_rcnn_loc_loss:0.0694024860859 | fast_rcnn_cla_loss:0.0653923526406 | fast_rcnn_total_loss:0.134794831276 |
total_loss:30.0771274567 | pre_cost_time:0.531568050385s
2018-06-14 12:20:45: step60433 image_name:001663.jpg |
rpn_loc_loss:13.3469562531 | rpn_cla_loss:0.133825153112 | rpn_total_loss:13.4807815552 |
fast_rcnn_loc_loss:0.0642159730196 | fast_rcnn_cla_loss:0.0958302691579 | fast_rcnn_total_loss:0.160046249628 |
total_loss:14.3727817535 | pre_cost_time:0.455081939697s
2018-06-14 12:20:50: step60443 image_name:001862.jpg |
rpn_loc_loss:0.353244125843 | rpn_cla_loss:0.0793683156371 | rpn_total_loss:0.432612448931 |
fast_rcnn_loc_loss:0.0735497996211 | fast_rcnn_cla_loss:0.066331461072 | fast_rcnn_total_loss:0.139881253242 |
total_loss:1.30444681644 | pre_cost_time:0.485862016678s
2018-06-14 12:20:55: step60453 image_name:004847.jpg |
rpn_loc_loss:15.3173036575 | rpn_cla_loss:0.270304232836 | rpn_total_loss:15.5876083374 |
fast_rcnn_loc_loss:0.0786234289408 | fast_rcnn_cla_loss:0.0949833244085 | fast_rcnn_total_loss:0.173606753349 |
total_loss:16.4931697845 | pre_cost_time:0.465280056s
2018-06-14 12:20:59: step60463 image_name:008649.jpg |
rpn_loc_loss:11.835559845 | rpn_cla_loss:0.0990793630481 | rpn_total_loss:11.9346389771 |
fast_rcnn_loc_loss:0.0224529933184 | fast_rcnn_cla_loss:0.0669757574797 | fast_rcnn_total_loss:0.0894287526608 |
total_loss:12.756023407 | pre_cost_time:0.483344078064s
2018-06-14 12:21:04: step60473 image_name:001563.jpg |
rpn_loc_loss:2.02798223495 | rpn_cla_loss:0.268615990877 | rpn_total_loss:2.29659819603 |
fast_rcnn_loc_loss:0.122499987483 | fast_rcnn_cla_loss:0.214369207621 | fast_rcnn_total_loss:0.336869180202 |
total_loss:3.36542034149 | pre_cost_time:0.507468938828s
2018-06-14 12:21:10: step60483 image_name:009480.jpg |
rpn_loc_loss:10.0660276413 | rpn_cla_loss:0.475047171116 | rpn_total_loss:10.5410747528 |
fast_rcnn_loc_loss:0.106006294489 | fast_rcnn_cla_loss:0.161696076393 | fast_rcnn_total_loss:0.267702370882 |
total_loss:11.5407314301 | pre_cost_time:0.480077028275s
2018-06-14 12:21:14: step60493 image_name:002549.jpg |
rpn_loc_loss:1.25509309769 | rpn_cla_loss:0.556017160416 | rpn_total_loss:1.8111102581 |
fast_rcnn_loc_loss:0.0191622953862 | fast_rcnn_cla_loss:0.0564195141196 | fast_rcnn_total_loss:0.0755818113685 |
total_loss:2.61864495277 | pre_cost_time:0.465098142624s
2018-06-14 12:21:19: step60503 image_name:007583.jpg |
rpn_loc_loss:0.33879968524 | rpn_cla_loss:0.222999542952 | rpn_total_loss:0.561799228191 |
fast_rcnn_loc_loss:0.0962313115597 | fast_rcnn_cla_loss:0.170783832669 | fast_rcnn_total_loss:0.26701515913 |
total_loss:1.5607676506 | pre_cost_time:0.43776011467s
2018-06-14 12:21:27: step60513 image_name:005022.jpg |
rpn_loc_loss:0.154285207391 | rpn_cla_loss:0.265934407711 | rpn_total_loss:0.420219600201 |
fast_rcnn_loc_loss:0.0 | fast_rcnn_cla_loss:0.0173826627433 | fast_rcnn_total_loss:0.0173826627433 |
total_loss:1.16955566406 | pre_cost_time:0.501232147217s
2018-06-14 12:21:32: step60523 image_name:003551.jpg |
rpn_loc_loss:0.595891058445 | rpn_cla_loss:0.174645081162 | rpn_total_loss:0.770536124706 |
fast_rcnn_loc_loss:0.0125400628895 | fast_rcnn_cla_loss:0.0366840735078 | fast_rcnn_total_loss:0.0492241382599 |
total_loss:1.55171358585 | pre_cost_time:0.473511219025s
2018-06-14 12:21:37: step60533 image_name:002795.jpg |
rpn_loc_loss:0.415669262409 | rpn_cla_loss:0.368684262037 | rpn_total_loss:0.784353494644 |
fast_rcnn_loc_loss:0.0294939801097 | fast_rcnn_cla_loss:0.0599008537829 | fast_rcnn_total_loss:0.0893948376179 |
total_loss:1.60570168495 | pre_cost_time:0.462094068527s
2018-06-14 12:21:42: step60543 image_name:009306.jpg |
rpn_loc_loss:8.61710548401 | rpn_cla_loss:0.437390208244 | rpn_total_loss:9.05449581146 |
fast_rcnn_loc_loss:0.106448456645 | fast_rcnn_cla_loss:0.136525779963 | fast_rcnn_total_loss:0.242974236608 |
total_loss:10.0294246674 | pre_cost_time:0.480259895325s
2018-06-14 12:21:47: step60553 image_name:000403.jpg |
rpn_loc_loss:17.2208366394 | rpn_cla_loss:0.108800955117 | rpn_total_loss:17.3296375275 |
fast_rcnn_loc_loss:0.0871697515249 | fast_rcnn_cla_loss:0.105191238225 | fast_rcnn_total_loss:0.1923609972 |
total_loss:18.2539520264 | pre_cost_time:0.4732401371s
2018-06-14 12:21:52: step60563 image_name:001776.jpg |
rpn_loc_loss:5.62985134125 | rpn_cla_loss:0.637479364872 | rpn_total_loss:6.26733064651 |
fast_rcnn_loc_loss:0.072887763381 | fast_rcnn_cla_loss:0.109321445227 | fast_rcnn_total_loss:0.182209208608 |
total_loss:7.18149423599 | pre_cost_time:0.50107383728s
2018-06-14 12:21:57: step60573 image_name:006702.jpg |
rpn_loc_loss:0.899392724037 | rpn_cla_loss:0.555935382843 | rpn_total_loss:1.45532810688 |
fast_rcnn_loc_loss:0.0513089261949 | fast_rcnn_cla_loss:0.0749804228544 | fast_rcnn_total_loss:0.126289352775 |
total_loss:2.31357073784 | pre_cost_time:0.476425170898s
2018-06-14 12:22:02: step60583 image_name:004071.jpg |
rpn_loc_loss:11.5561742783 | rpn_cla_loss:0.101583331823 | rpn_total_loss:11.6577577591 |
fast_rcnn_loc_loss:0.019030706957 | fast_rcnn_cla_loss:0.0453039929271 | fast_rcnn_total_loss:0.0643346980214 |
total_loss:12.4540472031 | pre_cost_time:0.428576946259s
2018-06-14 12:22:07: step60593 image_name:004482.jpg |
rpn_loc_loss:31.5635948181 | rpn_cla_loss:0.172389492393 | rpn_total_loss:31.7359848022 |
fast_rcnn_loc_loss:0.101848945022 | fast_rcnn_cla_loss:0.0788979232311 | fast_rcnn_total_loss:0.180746868253 |
total_loss:32.64868927 | pre_cost_time:0.479720115662s
2018-06-14 12:22:12: step60603 image_name:004670.jpg |
rpn_loc_loss:0.0734175369143 | rpn_cla_loss:0.269607931376 | rpn_total_loss:0.34302547574 |
fast_rcnn_loc_loss:0.0 | fast_rcnn_cla_loss:0.0191261749715 | fast_rcnn_total_loss:0.0191261749715 |
total_loss:1.09410512447 | pre_cost_time:0.509476900101s
2018-06-14 12:22:18: step60613 image_name:006023.jpg |
rpn_loc_loss:12.7789239883 | rpn_cla_loss:0.0952100381255 | rpn_total_loss:12.8741340637 |
fast_rcnn_loc_loss:0.0278376862407 | fast_rcnn_cla_loss:0.0558322630823 | fast_rcnn_total_loss:0.0836699455976 |
total_loss:13.6897583008 | pre_cost_time:0.487910985947s
2018-06-14 12:22:23: step60623 image_name:001005.jpg |
rpn_loc_loss:0.639028191566 | rpn_cla_loss:0.522748112679 | rpn_total_loss:1.16177630424 |
fast_rcnn_loc_loss:0.000708404520992 | fast_rcnn_cla_loss:0.0266169980168 | fast_rcnn_total_loss:0.0273254029453 |
total_loss:1.92105507851 | pre_cost_time:0.486565113068s
2018-06-14 12:22:28: step60633 image_name:009335.jpg |
rpn_loc_loss:0.282863616943 | rpn_cla_loss:0.551841139793 | rpn_total_loss:0.834704756737 |
fast_rcnn_loc_loss:0.0 | fast_rcnn_cla_loss:0.0135724116117 | fast_rcnn_total_loss:0.0135724116117 |
total_loss:1.58023047447 | pre_cost_time:0.482041120529s
2018-06-14 12:22:34: step60643 image_name:001650.jpg |
rpn_loc_loss:1.28183162212 | rpn_cla_loss:0.373366981745 | rpn_total_loss:1.65519857407 |
fast_rcnn_loc_loss:0.0110192643479 | fast_rcnn_cla_loss:0.0356653928757 | fast_rcnn_total_loss:0.0466846562922 |
total_loss:2.43383646011 | pre_cost_time:0.508905887604s
2018-06-14 12:22:39: step60653 image_name:003087.jpg |
rpn_loc_loss:0.322137057781 | rpn_cla_loss:0.134973421693 | rpn_total_loss:0.457110464573 |
fast_rcnn_loc_loss:0.0702097117901 | fast_rcnn_cla_loss:0.176795721054 | fast_rcnn_total_loss:0.247005432844 |
total_loss:1.43606948853 | pre_cost_time:0.467333078384s
2018-06-14 12:22:44: step60663 image_name:004298.jpg |
rpn_loc_loss:4.05768680573 | rpn_cla_loss:0.608667492867 | rpn_total_loss:4.66635417938 |
fast_rcnn_loc_loss:0.0645890682936 | fast_rcnn_cla_loss:0.1116142869 | fast_rcnn_total_loss:0.176203355193 |
total_loss:5.57451200485 | pre_cost_time:0.475219964981s
2018-06-14 12:22:49: step60673 image_name:005889.jpg |
rpn_loc_loss:6.49503231049 | rpn_cla_loss:0.119138218462 | rpn_total_loss:6.6141705513 |
fast_rcnn_loc_loss:0.162173375487 | fast_rcnn_cla_loss:0.170419067144 | fast_rcnn_total_loss:0.332592427731 |
total_loss:7.67871713638 | pre_cost_time:0.486320972443s
2018-06-14 12:22:53: step60683 image_name:004984.jpg |
rpn_loc_loss:37.5337257385 | rpn_cla_loss:0.104520298541 | rpn_total_loss:37.6382446289 |
fast_rcnn_loc_loss:0.0563882887363 | fast_rcnn_cla_loss:0.0821540281177 | fast_rcnn_total_loss:0.138542324305 |
total_loss:38.5087394714 | pre_cost_time:0.47732591629s
2018-06-14 12:22:58: step60693 image_name:003655.jpg |
rpn_loc_loss:17.961019516 | rpn_cla_loss:0.250774890184 | rpn_total_loss:18.2117938995 |
fast_rcnn_loc_loss:0.0939721092582 | fast_rcnn_cla_loss:0.0833453536034 | fast_rcnn_total_loss:0.177317470312 |
total_loss:19.1210651398 | pre_cost_time:0.470115184784s
2018-06-14 12:23:04: step60703 image_name:008504.jpg |
rpn_loc_loss:0.65000295639 | rpn_cla_loss:0.365115880966 | rpn_total_loss:1.01511883736 |
fast_rcnn_loc_loss:0.0546052791178 | fast_rcnn_cla_loss:0.130576968193 | fast_rcnn_total_loss:0.185182243586 |
total_loss:1.93225443363 | pre_cost_time:0.505920886993s
@yangxue0827 望大佬回复一下,谢谢
前5万次训练的参数都是默认参数,dataset_name修改过了的,基础网络用的是res101
@yangxue0827 @f27ny105t5123 请问你是如何在VOC2012数据集上训练的,我尝试dataset_name改为person,增加标签几个为perpon为1,其余所有类为0,但总是出现 InvalidArgumentError (see above for traceback): LossTensor is inf or nan : Tensor had NaN values [[Node: train_op/CheckNumerics = CheckNumericsT=DT_FLOAT, message="LossTensor is inf or nan", _device="/job:localhost/replica:0/task:0/device:GPU:0"]] [[Node: Momentum/update_resnet_v1_101/block3/unit_17/bottleneck_v1/conv3/BatchNorm/beta/ApplyMomentum/_2540 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_13388_Momentum/update_resnet_v1_101/block3/unit_17/bottleneck_v1/conv3/BatchNorm/beta/ApplyMomentum", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]] 错误。 参数已经改为: NET_NAME = 'resnet_v1_101' VERSION = 'v2_airplane' CLASS_NUM = 1 BASE_ANCHOR_SIZE_LIST = [15, 25, 40, 60, 80] LEVEL = ['P2', 'P3', 'P4', 'P5', "P6"] STRIDE = [4, 8, 16, 32, 64] ANCHOR_SCALES = [2 ** - 2,2 ** - 1,1] ANCHOR_RATIOS = [1, 0.5, 2, 1 / 3., 3., 1.5, 1 / 1.5] SCALE_FACTORS = [10., 10., 5., 5.] OUTPUT_STRIDE = 16 SHORT_SIDE_LEN = 600 DATASET_NAME = 'person'
BATCH_SIZE = 1 WEIGHT_DECAY = {'resnet_v1_50': 0.0001, 'resnet_v1_101': 0.0001} EPSILON = 1e-5 MOMENTUM = 0.9 MAX_ITERATION = 50000 GPU_GROUP = "1" LR = 0.00005 其余默认,为什么还是如此? 希望你能帮忙指点一下,让我能在VOC上运行起单类训练测试,谢谢~
请问你有解决这个问题吗?我在caltech数据集里做行人检测也是rpn loss不收敛,rpn_classification_loss有收敛趋势,但是rpn_location_loss波动还是很大。是因为训练的还不够吗?我把max_steps和max_iteration都改成了10万还是不行。
@f27ny105t5123 @zhanhuanli @0uu0 https://github.com/DetectionTeamUCAS/Faster-RCNN_Tensorflow
我也遇到同样的问题,类别识别为0 ,没有框,请问您解决了吗,能说说怎么解决这一类问题吗 @yangxue0827
@zhangxiaopang88 你可以参考https://github.com/DetectionTeamUCAS/Faster-RCNN_Tensorflow 这个的cfgs.py来进行参数设置。我也不清楚你们出现问题的具体原因,之前我跑没啥问题。
谢谢你,我已经训练自己的数据了,可是发现会有一些漏检的问题,网上说是样本不够,现在在增加样本,我是想问题下batch_size为1还可以用其他的参数去改变batch_size吗 @yangxue0827
这个代码的batch size 只能是1 @zhangxiaopang88
后来我查了一下发现faster-rcnn的batch_size写死了只能为1,谢谢你
你是怎么设置的呢,我也是什么都没有框出来 @zhangxiaopang88 @0uu0 @zhanhuanli
我拿用自己数据训练了一万七千次的模型做test,所有图片的类别识别都是0,没有预测框。请问你们是怎么解决问题的,需要修改什么参数设置呢? @zhangxiaopang88 @f27ny105t5123 @yangxue0827
@yangxue0827 @f27ny105t5123 请问你是如何在VOC2012数据集上训练的,我尝试dataset_name改为person,增加标签几个为perpon为1,其余所有类为0,但总是出现 InvalidArgumentError (see above for traceback): LossTensor is inf or nan : Tensor had NaN values [[Node: train_op/CheckNumerics = CheckNumericsT=DT_FLOAT, message="LossTensor is inf or nan", _device="/job:localhost/replica:0/task:0/device:GPU:0"]] [[Node: Momentum/update_resnet_v1_101/block3/unit_17/bottleneck_v1/conv3/BatchNorm/beta/ApplyMomentum/_2540 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_13388_Momentum/update_resnet_v1_101/block3/unit_17/bottleneck_v1/conv3/BatchNorm/beta/ApplyMomentum", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]] 错误。 参数已经改为: NET_NAME = 'resnet_v1_101' VERSION = 'v2_airplane' CLASS_NUM = 1 BASE_ANCHOR_SIZE_LIST = [15, 25, 40, 60, 80] LEVEL = ['P2', 'P3', 'P4', 'P5', "P6"] STRIDE = [4, 8, 16, 32, 64] ANCHOR_SCALES = [2 ** - 2,2 ** - 1,1] ANCHOR_RATIOS = [1, 0.5, 2, 1 / 3., 3., 1.5, 1 / 1.5] SCALE_FACTORS = [10., 10., 5., 5.] OUTPUT_STRIDE = 16 SHORT_SIDE_LEN = 600 DATASET_NAME = 'person'
BATCH_SIZE = 1 WEIGHT_DECAY = {'resnet_v1_50': 0.0001, 'resnet_v1_101': 0.0001} EPSILON = 1e-5 MOMENTUM = 0.9 MAX_ITERATION = 50000 GPU_GROUP = "1" LR = 0.00005 其余默认,为什么还是如此? 希望你能帮忙指点一下,让我能在VOC上运行起单类训练测试,谢谢~
您好,我也碰到了训练为NAN的情况,和你一样,请问一下最后你是怎么解决这个问题的呢??望回复!! @zhanhuanli @yangxue0827
应该是数据集的问题,你看看你的数据集处理的对不对。
发自我的 iPhone
在 2018年11月5日,14:29,TVXQ20031226 [email protected] 写道:
@yangxue0827 @f27ny105t5123 请问你是如何在VOC2012数据集上训练的,我尝试dataset_name改为person,增加标签几个为perpon为1,其余所有类为0,但总是出现 InvalidArgumentError (see above for traceback): LossTensor is inf or nan : Tensor had NaN values [[Node: train_op/CheckNumerics = CheckNumericsT=DT_FLOAT, message="LossTensor is inf or nan", _device="/job:localhost/replica:0/task:0/device:GPU:0"]] [[Node: Momentum/update_resnet_v1_101/block3/unit_17/bottleneck_v1/conv3/BatchNorm/beta/ApplyMomentum/_2540 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_13388_Momentum/update_resnet_v1_101/block3/unit_17/bottleneck_v1/conv3/BatchNorm/beta/ApplyMomentum", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]] 错误。 参数已经改为: NET_NAME = 'resnet_v1_101' VERSION = 'v2_airplane' CLASS_NUM = 1 BASE_ANCHOR_SIZE_LIST = [15, 25, 40, 60, 80] LEVEL = ['P2', 'P3', 'P4', 'P5', "P6"] STRIDE = [4, 8, 16, 32, 64] ANCHOR_SCALES = [2 ** - 2,2 ** - 1,1] ANCHOR_RATIOS = [1, 0.5, 2, 1 / 3., 3., 1.5, 1 / 1.5] SCALE_FACTORS = [10., 10., 5., 5.] OUTPUT_STRIDE = 16 SHORT_SIDE_LEN = 600 DATASET_NAME = 'person'
BATCH_SIZE = 1 WEIGHT_DECAY = {'resnet_v1_50': 0.0001, 'resnet_v1_101': 0.0001} EPSILON = 1e-5 MOMENTUM = 0.9 MAX_ITERATION = 50000 GPU_GROUP = "1" LR = 0.00005 其余默认,为什么还是如此? 希望你能帮忙指点一下,让我能在VOC上运行起单类训练测试,谢谢~
您好,我也碰到了训练为NAN的情况,和你一样,请问一下最后你是怎么解决这个问题的呢??望回复!! @zhanhuanli @yangxue0827
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
不是数据集的问题,我检查过数据集,没有超界的情况,但是我之前是用vgg16训练的,feature_maps_dict那段代码是我自己写的,估计是那里出了问题,我现在把模型换成resnet以后就成功了 @zhanhuanli
Recommend improved code: https://github.com/DetectionTeamUCAS/FPN_Tensorflow. @f27ny105t5123 @zhanhuanli @0uu0
@TVXQ20031226 ,您好,我也想用vgg16训练,请问您是如何修改代码的,方便交流一下吗?
后来我查了一下发现faster-rcnn的batch_size写死了只能为1,谢谢你
您好,我在复现代码时遇到的问题是进行测试时,测试图片没有保存在相应的文件夹,还有用代码生成的tfrecord文件是空的,希望可以得到您的回答。