faster-rcnn-pytorch icon indicating copy to clipboard operation
faster-rcnn-pytorch copied to clipboard

3.0一直报错

Open HuKai97 opened this issue 2 years ago • 17 comments

Traceback (most recent call last): File "/home/yyds/hukai/faster-rcnn-pytorch-master/train.py", line 382, in fit_one_epoch(model, train_util, loss_history, optimizer, epoch, epoch_step, epoch_step_val, gen, gen_val, UnFreeze_Epoch, Cuda, fp16, scaler, save_period, save_dir) File "/home/yyds/hukai/faster-rcnn-pytorch-master/utils/utils_fit.py", line 27, in fit_one_epoch rpn_loc, rpn_cls, roi_loc, roi_cls, total = train_util.train_step(images, boxes, labels, 1, fp16, scaler) File "/home/yyds/hukai/faster-rcnn-pytorch-master/nets/frcnn_training.py", line 321, in train_step losses = self.forward(imgs, bboxes, labels, scale) File "/home/yyds/hukai/faster-rcnn-pytorch-master/nets/frcnn_training.py", line 248, in forward rpn_locs, rpn_scores, rois, roi_indices, anchor = self.model_train(x = [base_feature, img_size], scale = scale, mode = 'rpn') File "/home/yyds/anaconda3/envs/hukai/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl return forward_call(*input, **kwargs) File "/home/yyds/anaconda3/envs/hukai/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 166, in forward return self.module(*inputs[0], **kwargs[0]) File "/home/yyds/anaconda3/envs/hukai/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl return forward_call(*input, **kwargs) File "/home/yyds/hukai/faster-rcnn-pytorch-master/nets/frcnn.py", line 97, in forward rpn_locs, rpn_scores, rois, roi_indices, anchor = self.rpn.forward(base_feature, img_size, scale) File "/home/yyds/hukai/faster-rcnn-pytorch-master/nets/rpn.py", line 177, in forward rois = torch.cat(rois, dim=0).type_as(x) RuntimeError: Sizes of tensors must match except in dimension 1. Got 117 and 73 (The offending index is 0)

HuKai97 avatar Apr 24 '22 08:04 HuKai97

n_train_post_nms = 600, 有些图片nms后根本就没有600个roi,所以roi concat报错

HuKai97 avatar Apr 24 '22 08:04 HuKai97

你是否修改了input _shape呢

bubbliiiing avatar Apr 26 '22 00:04 bubbliiiing

为改成512x512了

HuKai97 avatar Apr 26 '22 00:04 HuKai97

你换回600x600应该没错

bubbliiiing avatar Apr 26 '22 00:04 bubbliiiing

消融实验全是512x512的,这个代码没法512x512跑吗?

HuKai97 avatar Apr 26 '22 00:04 HuKai97

image 这是我用512跑的…

bubbliiiing avatar Apr 26 '22 00:04 bubbliiiing

我刚试了……理论可以吧,你改了什么别的么

bubbliiiing avatar Apr 26 '22 00:04 bubbliiiing

Epoch 12/25: 100% 378/378 [02:57<00:00, 2.12it/s, lr=6.08e-5, roi_cls=0.174, roi_loc=0.845, rpn_cls=0.0436, rpn_loc=0.106, total_loss=1.17] Finish Train Start Validation Epoch 12/25: 100% 42/42 [00:08<00:00, 4.92it/s, val_loss=1.29] Finish Validation Epoch:12/25 Total Loss: 1.168 || Val Loss: 1.293 Start Train Epoch 13/25: 28% 107/378 [00:50<02:08, 2.10it/s, lr=5.4e-5, roi_cls=0.165, roi_loc=0.809, rpn_cls=0.0376, rpn_loc=0.0958, total_loss=1.11] Traceback (most recent call last): File "train.py", line 382, in fit_one_epoch(model, train_util, loss_history, optimizer, epoch, epoch_step, epoch_step_val, gen, gen_val, UnFreeze_Epoch, Cuda, fp16, scaler, save_period, save_dir) File "/content/faster-rcnn-pytorch/utils/utils_fit.py", line 27, in fit_one_epoch rpn_loc, rpn_cls, roi_loc, roi_cls, total = train_util.train_step(images, boxes, labels, 1, fp16, scaler) File "/content/faster-rcnn-pytorch/nets/frcnn_training.py", line 321, in train_step losses = self.forward(imgs, bboxes, labels, scale) File "/content/faster-rcnn-pytorch/nets/frcnn_training.py", line 248, in forward rpn_locs, rpn_scores, rois, roi_indices, anchor = self.model_train(x = [base_feature, img_size], scale = scale, mode = 'rpn') File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 547, in call result = self.forward(*input, **kwargs) File "/usr/local/lib/python3.7/dist-packages/torch/nn/parallel/data_parallel.py", line 150, in forward return self.module(*inputs[0], **kwargs[0]) File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 547, in call result = self.forward(*input, **kwargs) File "/content/faster-rcnn-pytorch/nets/frcnn.py", line 97, in forward rpn_locs, rpn_scores, rois, roi_indices, anchor = self.rpn.forward(base_feature, img_size, scale) File "/content/faster-rcnn-pytorch/nets/rpn.py", line 177, in forward rois = torch.cat(rois, dim=0).type_as(x) RuntimeError: invalid argument 0: Sizes of tensors must match except in dimension 0. Got 593 and 600 in dimension 1 at /pytorch/aten/src/THC/generic/THCTensorMath.cu:71

oceaneyesz avatar Apr 26 '22 04:04 oceaneyesz

这次是采用默认的参数设定进行训练的,也出现了这个问题

oceaneyesz avatar Apr 26 '22 04:04 oceaneyesz

H:\pychram\yolov5\Scripts\python.exe K:/pest-main/faster-rcnn-pytorch-master/train.py Number of devices: 1 Load weights model_data/voc_weights_resnet.pth. Start Train Epoch 1/100: 100%|██████████| 1405/1405 [27:57<00:00, 1.19s/it, lr=0.00125, roi_cls=0.136, roi_loc=0.568, rpn_cls=0.0407, rpn_loc=0.719, total_loss=1.46] Epoch 1/100: 0%| | 0/201 [00:00<?, ?it/s<class 'dict'>]Finish Train Start Validation Epoch 1/100: 100%|██████████| 201/201 [01:58<00:00, 1.69it/s, val_loss=1.39] Finish Validation Epoch:1/100 Total Loss: 1.463 || Val Loss: 1.388 Start Train Epoch 2/100: 49%|████▊ | 682/1405 [12:44<13:30, 1.12s/it, lr=0.00125, roi_cls=0.145, roi_loc=0.743, rpn_cls=0.0351, rpn_loc=0.502, total_loss=1.43] Traceback (most recent call last): File "K:/pest-main/faster-rcnn-pytorch-master/train.py", line 382, in fit_one_epoch(model, train_util, loss_history, optimizer, epoch, epoch_step, epoch_step_val, gen, gen_val, UnFreeze_Epoch, Cuda, fp16, scaler, save_period, save_dir) File "K:\pest-main\faster-rcnn-pytorch-master\utils\utils_fit.py", line 27, in fit_one_epoch rpn_loc, rpn_cls, roi_loc, roi_cls, total = train_util.train_step(images, boxes, labels, 1, fp16, scaler) File "K:\pest-main\faster-rcnn-pytorch-master\nets\frcnn_training.py", line 321, in train_step losses = self.forward(imgs, bboxes, labels, scale) File "K:\pest-main\faster-rcnn-pytorch-master\nets\frcnn_training.py", line 248, in forward rpn_locs, rpn_scores, rois, roi_indices, anchor = self.model_train(x = [base_feature, img_size], scale = scale, mode = 'rpn') File "D:\python3.8.12\lib\site-packages\torch\nn\modules\module.py", line 889, in _call_impl result = self.forward(*input, **kwargs) File "D:\python3.8.12\lib\site-packages\torch\nn\parallel\data_parallel.py", line 165, in forward return self.module(*inputs[0], **kwargs[0]) File "D:\python3.8.12\lib\site-packages\torch\nn\modules\module.py", line 889, in _call_impl result = self.forward(*input, **kwargs) File "K:\pest-main\faster-rcnn-pytorch-master\nets\frcnn.py", line 97, in forward rpn_locs, rpn_scores, rois, roi_indices, anchor = self.rpn.forward(base_feature, img_size, scale) File "K:\pest-main\faster-rcnn-pytorch-master\nets\rpn.py", line 177, in forward rois = torch.cat(rois, dim=0).type_as(x) RuntimeError: Sizes of tensors must match except in dimension 1. Got 600 and 558 (The offending index is 0)

我的也还是这个问题

HuKai97 avatar Apr 26 '22 04:04 HuKai97

所有的参数设置:

if name == "main": #-------------------------------# # 是否使用Cuda # 没有GPU可以设置成False #-------------------------------# Cuda = True #---------------------------------------------------------------------# # train_gpu 训练用到的GPU # 默认为第一张卡、双卡为[0, 1]、三卡为[0, 1, 2] # 在使用多GPU时,每个卡上的batch为总batch除以卡的数量。 #---------------------------------------------------------------------# train_gpu = [0, ] #---------------------------------------------------------------------# # fp16 是否使用混合精度训练 # 可减少约一半的显存、需要pytorch1.7.1以上 #---------------------------------------------------------------------# fp16 = False #---------------------------------------------------------------------# # classes_path 指向model_data下的txt,与自己训练的数据集相关 # 训练前一定要修改classes_path,使其对应自己的数据集 #---------------------------------------------------------------------# classes_path = 'model_data/pest_classes.txt' #----------------------------------------------------------------------------------------------------------------------------# # 权值文件的下载请看README,可以通过网盘下载。模型的 预训练权重 对不同数据集是通用的,因为特征是通用的。 # 模型的 预训练权重 比较重要的部分是 主干特征提取网络的权值部分,用于进行特征提取。 # 预训练权重对于99%的情况都必须要用,不用的话主干部分的权值太过随机,特征提取效果不明显,网络训练的结果也不会好 # # 如果训练过程中存在中断训练的操作,可以将model_path设置成logs文件夹下的权值文件,将已经训练了一部分的权值再次载入。 # 同时修改下方的 冻结阶段 或者 解冻阶段 的参数,来保证模型epoch的连续性。 #
# 当model_path = ''的时候不加载整个模型的权值。 # # 此处使用的是整个模型的权重,因此是在train.py进行加载的,下面的pretrain不影响此处的权值加载。 # 如果想要让模型从主干的预训练权值开始训练,则设置model_path = '',下面的pretrain = True,此时仅加载主干。 # 如果想要让模型从0开始训练,则设置model_path = '',下面的pretrain = Fasle,Freeze_Train = Fasle,此时从0开始训练,且没有冻结主干的过程。 #
# 一般来讲,网络从0开始的训练效果会很差,因为权值太过随机,特征提取效果不明显,因此非常、非常、非常不建议大家从0开始训练! # 如果一定要从0开始,可以了解imagenet数据集,首先训练分类模型,获得网络的主干部分权值,分类模型的 主干部分 和该模型通用,基于此进行训练。 #----------------------------------------------------------------------------------------------------------------------------# model_path = 'model_data/voc_weights_resnet.pth' #------------------------------------------------------# # input_shape 输入的shape大小 #------------------------------------------------------# input_shape = [512, 512] #---------------------------------------------# # vgg # resnet50 #---------------------------------------------# backbone = "resnet50" #----------------------------------------------------------------------------------------------------------------------------# # pretrained 是否使用主干网络的预训练权重,此处使用的是主干的权重,因此是在模型构建的时候进行加载的。 # 如果设置了model_path,则主干的权值无需加载,pretrained的值无意义。 # 如果不设置model_path,pretrained = True,此时仅加载主干开始训练。 # 如果不设置model_path,pretrained = False,Freeze_Train = Fasle,此时从0开始训练,且没有冻结主干的过程。 #----------------------------------------------------------------------------------------------------------------------------# pretrained = True #------------------------------------------------------------------------# # anchors_size用于设定先验框的大小,每个特征点均存在9个先验框。 # anchors_size每个数对应3个先验框。 # 当anchors_size = [8, 16, 32]的时候,生成的先验框宽高约为: # [90, 180] ; [180, 360]; [360, 720]; [128, 128]; # [256, 256]; [512, 512]; [180, 90] ; [360, 180]; # [720, 360]; 详情查看anchors.py # 如果想要检测小物体,可以减小anchors_size靠前的数。 # 比如设置anchors_size = [4, 16, 32] #------------------------------------------------------------------------# anchors_size = [8, 16, 32]

#----------------------------------------------------------------------------------------------------------------------------#
#   训练分为两个阶段,分别是冻结阶段和解冻阶段。设置冻结阶段是为了满足机器性能不足的同学的训练需求。
#   冻结训练需要的显存较小,显卡非常差的情况下,可设置Freeze_Epoch等于UnFreeze_Epoch,此时仅仅进行冻结训练。
#      
#   在此提供若干参数设置建议,各位训练者根据自己的需求进行灵活调整:
#   (一)从整个模型的预训练权重开始训练: 
#       Adam:
#           Init_Epoch = 0,Freeze_Epoch = 50,UnFreeze_Epoch = 100,Freeze_Train = True,optimizer_type = 'adam',Init_lr = 1e-4。(冻结)
#           Init_Epoch = 0,UnFreeze_Epoch = 100,Freeze_Train = False,optimizer_type = 'adam',Init_lr = 1e-4。(不冻结)
#       SGD:
#           Init_Epoch = 0,Freeze_Epoch = 50,UnFreeze_Epoch = 100,Freeze_Train = True,optimizer_type = 'sgd',Init_lr = 1e-2。(冻结)
#           Init_Epoch = 0,UnFreeze_Epoch = 100,Freeze_Train = False,optimizer_type = 'sgd',Init_lr = 1e-2。(不冻结)
#       其中:UnFreeze_Epoch可以在100-300之间调整。
#   (二)从主干网络的预训练权重开始训练:
#       Adam:
#           Init_Epoch = 0,Freeze_Epoch = 50,UnFreeze_Epoch = 100,Freeze_Train = True,optimizer_type = 'adam',Init_lr = 1e-4。(冻结)
#           Init_Epoch = 0,UnFreeze_Epoch = 100,Freeze_Train = False,optimizer_type = 'adam',Init_lr = 1e-4。(不冻结)
#       SGD:
#           Init_Epoch = 0,Freeze_Epoch = 50,UnFreeze_Epoch = 150,Freeze_Train = True,optimizer_type = 'sgd',Init_lr = 1e-2。(冻结)
#           Init_Epoch = 0,UnFreeze_Epoch = 150,Freeze_Train = False,optimizer_type = 'sgd',Init_lr = 1e-2。(不冻结)
#       其中:由于从主干网络的预训练权重开始训练,主干的权值不一定适合目标检测,需要更多的训练跳出局部最优解。
#             UnFreeze_Epoch可以在150-300之间调整,YOLOV5和YOLOX均推荐使用300。
#             Adam相较于SGD收敛的快一些。因此UnFreeze_Epoch理论上可以小一点,但依然推荐更多的Epoch。
#   (三)batch_size的设置:
#       在显卡能够接受的范围内,以大为好。显存不足与数据集大小无关,提示显存不足(OOM或者CUDA out of memory)请调小batch_size。
#       faster rcnn的Batch BatchNormalization层已经冻结,batch_size可以为1
#----------------------------------------------------------------------------------------------------------------------------#
#------------------------------------------------------------------#
#   冻结阶段训练参数
#   此时模型的主干被冻结了,特征提取网络不发生改变
#   占用的显存较小,仅对网络进行微调
#   Init_Epoch          模型当前开始的训练世代,其值可以大于Freeze_Epoch,如设置:
#                       Init_Epoch = 60、Freeze_Epoch = 50、UnFreeze_Epoch = 100
#                       会跳过冻结阶段,直接从60代开始,并调整对应的学习率。
#                       (断点续练时使用)
#   Freeze_Epoch        模型冻结训练的Freeze_Epoch
#                       (当Freeze_Train=False时失效)
#   Freeze_batch_size   模型冻结训练的batch_size
#                       (当Freeze_Train=False时失效)
#------------------------------------------------------------------#
Init_Epoch          = 0
Freeze_Epoch        = 50
Freeze_batch_size   = 16
#------------------------------------------------------------------#
#   解冻阶段训练参数
#   此时模型的主干不被冻结了,特征提取网络会发生改变
#   占用的显存较大,网络所有的参数都会发生改变
#   UnFreeze_Epoch          模型总共训练的epoch
#   Unfreeze_batch_size     模型在解冻后的batch_size
#------------------------------------------------------------------#
UnFreeze_Epoch      = 100
Unfreeze_batch_size = 2
#------------------------------------------------------------------#
#   Freeze_Train    是否进行冻结训练
#                   默认先冻结主干训练后解冻训练。
#                   如果设置Freeze_Train=False,建议使用优化器为sgd
#------------------------------------------------------------------#
Freeze_Train        = False

#------------------------------------------------------------------#
#   其它训练参数:学习率、优化器、学习率下降有关
#------------------------------------------------------------------#
#------------------------------------------------------------------#
#   Init_lr         模型的最大学习率
#                   当使用Adam优化器时建议设置  Init_lr=1e-4
#                   当使用SGD优化器时建议设置   Init_lr=1e-2
#   Min_lr          模型的最小学习率,默认为最大学习率的0.01
#------------------------------------------------------------------#
Init_lr             = 1e-2
Min_lr              = Init_lr * 0.01
#------------------------------------------------------------------#
#   optimizer_type  使用到的优化器种类,可选的有adam、sgd
#                   当使用Adam优化器时建议设置  Init_lr=1e-4
#                   当使用SGD优化器时建议设置   Init_lr=1e-2
#   momentum        优化器内部使用到的momentum参数
#   weight_decay    权值衰减,可防止过拟合
#                   adam会导致weight_decay错误,使用adam时建议设置为0。
#------------------------------------------------------------------#
optimizer_type      = "sgd"
momentum            = 0.9
weight_decay        = 5e-4
#------------------------------------------------------------------#
#   lr_decay_type   使用到的学习率下降方式,可选的有'step'、'cos'
#------------------------------------------------------------------#
lr_decay_type       = 'step'
#------------------------------------------------------------------#
#   save_period     多少个epoch保存一次权值,默认每个世代都保存
#------------------------------------------------------------------#
save_period         = 1
#------------------------------------------------------------------#
#   save_dir        权值与日志文件保存的文件夹
#------------------------------------------------------------------#
save_dir            = 'logs'
#------------------------------------------------------------------#
#   num_workers     用于设置是否使用多线程读取数据,1代表关闭多线程
#                   开启后会加快数据读取速度,但是会占用更多内存
#                   在IO为瓶颈的时候再开启多线程,即GPU运算速度远大于读取图片的速度。
#------------------------------------------------------------------#
num_workers         = 0
#----------------------------------------------------#
#   获得图片路径和标签
#----------------------------------------------------#
train_annotation_path   = '2007_train.txt'
val_annotation_path     = '2007_val.txt'

#----------------------------------------------------#
#   获取classes和anchor
#----------------------------------------------------#
class_names, num_classes = get_classes(classes_path)

HuKai97 avatar Apr 26 '22 04:04 HuKai97

RuntimeError: Sizes of tensors must match except in dimension 0. Expected size 600 but got size 557 for tensor number 7 in the list. 我和你同样的问题,请问你解决了吗

rongliangtang avatar Apr 27 '22 02:04 rongliangtang

没解决

HuKai97 avatar Apr 27 '22 04:04 HuKai97

我刚试了……理论可以吧,你改了什么别的么

大佬,不解决下这个问题吗?

HuKai97 avatar Apr 27 '22 05:04 HuKai97

我看看…确实不理解为什么

bubbliiiing avatar Apr 27 '22 15:04 bubbliiiing

已经修复了。请重新git clone一下。 应该是你们的数据集目标比较单一,导致建议框高度集中,进而导致建议框数量偏少。已经修复。

bubbliiiing avatar Apr 27 '22 15:04 bubbliiiing

或者下载release里面的v3.1也可以的,给你们造成困扰啦。

bubbliiiing avatar Apr 27 '22 15:04 bubbliiiing