VFA icon indicating copy to clipboard operation
VFA copied to clipboard

训练时发生错误

Open jly99 opened this issue 1 year ago • 4 comments

环境:安装readme中配置的环境 系统:Linux 训练执行命令:python train.py configs/vfa/voc/vfa_split1/vfa_r101_c4_8xb4_voc-split1_base-training.py 报错:在使用voc数据集来进行base-training,开始是可以正常训练的,当训练到了3000个iter的时候,也可以成功进行checkpoint的保存,但是接着在进行验证集的推理的时候,会发生报错,如下: [ ] 0/4952, elapsed: 0s, ETA:Traceback (most recent call last): File "train.py", line 252, in main() File "train.py", line 241, in main train_detector( File "/home/hdhcy/anaconda3/envs/vfa/lib/python3.8/site-packages/mmfewshot/detection/apis/train.py", line 197, in train_detector runner.run(data_loaders, cfg.workflow) File "/home/hdhcy/anaconda3/envs/vfa/lib/python3.8/site-packages/mmcv/runner/iter_based_runner.py", line 133, in run iter_runner(iter_loaders[i], **kwargs) File "/home/hdhcy/anaconda3/envs/vfa/lib/python3.8/site-packages/mmcv/runner/iter_based_runner.py", line 66, in train self.call_hook('after_train_iter') File "/home/hdhcy/anaconda3/envs/vfa/lib/python3.8/site-packages/mmcv/runner/base_runner.py", line 307, in call_hook getattr(hook, fn_name)(self) File "/home/hdhcy/anaconda3/envs/vfa/lib/python3.8/site-packages/mmcv/runner/hooks/evaluation.py", line 232, in after_train_iter self._do_evaluate(runner) File "/home/hdhcy/anaconda3/envs/vfa/lib/python3.8/site-packages/mmfewshot/detection/core/evaluation/eval_hooks.py", line 47, in _do_evaluate results = single_gpu_test(runner.model, self.dataloader, show=False) File "/home/hdhcy/anaconda3/envs/vfa/lib/python3.8/site-packages/mmfewshot/detection/apis/test.py", line 45, in single_gpu_test result = model(mode='test', rescale=True, **data) File "/home/hdhcy/anaconda3/envs/vfa/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(*input, **kwargs) File "/home/hdhcy/anaconda3/envs/vfa/lib/python3.8/site-packages/mmcv/parallel/data_parallel.py", line 42, in forward return super().forward(*inputs, **kwargs) File "/home/hdhcy/anaconda3/envs/vfa/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 159, in forward return self.module(*inputs[0], **kwargs[0]) File "/home/hdhcy/anaconda3/envs/vfa/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(*input, **kwargs) File "/home/hdhcy/anaconda3/envs/vfa/lib/python3.8/site-packages/mmcv/runner/fp16_utils.py", line 98, in new_func return old_func(*args, **kwargs) File "/home/hdhcy/anaconda3/envs/vfa/lib/python3.8/site-packages/mmfewshot/detection/models/detectors/query_support_detector.py", line 173, in forward return self.forward_test(img, img_metas, **kwargs) File "/home/hdhcy/anaconda3/envs/vfa/lib/python3.8/site-packages/mmdet/models/detectors/base.py", line 147, in forward_test return self.simple_test(imgs[0], img_metas[0], **kwargs) File "/home/hdhcy/opt/vfa-main-new/vfa/vfa_detector.py", line 100, in simple_test bbox_results = super().simple_test(img, img_metas, proposals, rescale) File "/home/hdhcy/anaconda3/envs/vfa/lib/python3.8/site-packages/mmfewshot/detection/models/detectors/meta_rcnn.py", line 176, in simple_test return self.roi_head.simple_test( File "/home/hdhcy/anaconda3/envs/vfa/lib/python3.8/site-packages/mmfewshot/detection/models/roi_heads/meta_rcnn_roi_head.py", line 272, in simple_test det_bboxes, det_labels = self.simple_test_bboxes( File "/home/hdhcy/opt/vfa-main-new/vfa/vfa_roi_head.py", line 281, in simple_test_bboxes det_bbox, det_label = self.bbox_head.get_bboxes( File "/home/hdhcy/anaconda3/envs/vfa/lib/python3.8/site-packages/mmcv/runner/fp16_utils.py", line 186, in new_func return old_func(*args, **kwargs) File "/home/hdhcy/anaconda3/envs/vfa/lib/python3.8/site-packages/mmdet/models/roi_heads/bbox_heads/bbox_head.py", line 369, in get_bboxes det_bboxes, det_labels = multiclass_nms(bboxes, scores, File "/home/hdhcy/anaconda3/envs/vfa/lib/python3.8/site-packages/mmdet/core/post_processing/bbox_nms.py", line 38, in multiclass_nms bboxes = multi_bboxes.view(multi_scores.size(0), -1, 4) RuntimeError: cannot reshape tensor of 0 elements into shape [0, -1, 4] because the unspecified dimension size -1 can be any value and is ambiguous

请问一下作者有遇到过这样的问题吗?请解答,谢谢

jly99 avatar May 05 '23 02:05 jly99

我也遇到了相同的问题。有无大佬指教!

Burgundy-Red avatar May 06 '23 12:05 Burgundy-Red

同问

leechenggg avatar Jun 01 '23 07:06 leechenggg

遇到了同样的问题,应该是训练时loss值计算出来为nan导致的,调低学习率就可以

lixi92 avatar Dec 15 '23 07:12 lixi92

遇到了同样的问题,应该是训练时loss值计算出来为nan导致的,调低学习率就可以

qjh666888 avatar Mar 06 '24 08:03 qjh666888