PaddleYOLO icon indicating copy to clipboard operation
PaddleYOLO copied to clipboard

训练错误

Open JackYang825 opened this issue 3 years ago • 10 comments

问题确认 Search before asking

  • [X] 我已经查询历史issue,没有报过同样bug。I have searched the issues and found no similar bug report.

bug描述 Describe the Bug

使用yolov7 yolov7_l_300e_coco.yml 训练

PaddleDetection/ppdet/modeling/architectures/yolov5.py", line 88, in _forward yolo_losses = self.yolo_head(neck_feats, self.inputs) File "/usr/local/lib/python3.8/dist-packages/paddle/fluid/dygraph/layers.py", line 930, in call return self._dygraph_call_func(*inputs, **kwargs) File "/usr/local/lib/python3.8/dist-packages/paddle/fluid/dygraph/layers.py", line 915, in _dygraph_call_func outputs = self.forward(*inputs, **kwargs) File "/work/ptm-online/PaddleDetection/ppdet/modeling/heads/yolo_head.py", line 726, in forward return self.loss(yolo_outputs + yolo_outputs_aux, targets, File "/usr/local/lib/python3.8/dist-packages/paddle/fluid/dygraph/layers.py", line 930, in call return self._dygraph_call_func(*inputs, **kwargs) File "/usr/local/lib/python3.8/dist-packages/paddle/fluid/dygraph/layers.py", line 915, in dygraph_call_func outputs = self.forward(*inputs, **kwargs) File "/work/ptm-online/PaddleDetection/ppdet/modeling/losses/yolo_loss.py", line 512, in forward bs, as, gjs, gis, targets, anchors = self.build_targets( File "/work/ptm-online/PaddleDetection/ppdet/modeling/losses/yolo_loss.py", line 593, in build_targets indices, anch = self.find_3_positive(p, targets, anchors) File "/work/ptm-online/PaddleDetection/ppdet/modeling/losses/yolo_loss.py", line 783, in find_3_positive gxi = gain[[2, 3]] - gxy # inverse File "/usr/local/lib/python3.8/dist-packages/paddle/fluid/dygraph/math_op_patch.py", line 299, in impl return math_op(self, other_var, 'axis', axis) ValueError: (InvalidArgument) Broadcast dimension mismatch. Operands could not be broadcast together with the shape of X = [2] and the shape of Y = [0, 2, 7]. Received [2] in X is not equal to [7] in Y at i:2. [Hint: Expected x_dims_array[i] == y_dims_array[i] || x_dims_array[i] <= 1 || y_dims_array[i] <= 1 == true, but received x_dims_array[i] == y_dims_array[i] || x_dims_array[i] <= 1 || y_dims_array[i] <= 1:0 != true:1.] (at /paddle/paddle/phi/kernels/funcs/common_shape.h:84) [operator < elementwise_sub > error]

复现环境 Environment

No response

是否愿意提交PR Are you willing to submit a PR?

  • [ ] Yes I'd like to help by submitting a PR!

JackYang825 avatar Aug 19 '22 08:08 JackYang825

你好,请说下训练命令、几卡训的、总共bs多少、以及paddle版本以便排查,谢谢。 #6

nemonameless avatar Aug 19 '22 08:08 nemonameless

你好,请说下训练命令、几卡训的、总共bs多少、以及paddle版本以便排查,谢谢。 #6

命令: /usr/bin/python -u /work/ptm-online/PaddleDetection/tools/train.py -c /work/ptm-online/PaddleDetection/configs/yolov7/yolov7_l_300e_coco.yml --use_vdl=True --vdl_log_dir=./output --eval 卡数: 单卡 bs: 4 paddle版本: paddlepaddle-gpu==2.3.1.post112

JackYang825 avatar Aug 19 '22 08:08 JackYang825

修复了。 单卡bs=4这样训意义不大,如果你有资源,建议开大bs到总bs=32以上最好,并且开启amp混合精度训练降低显存,如果没有资源,可以训tiny版模型保持总bs32以上最好。

nemonameless avatar Aug 19 '22 15:08 nemonameless

修复了。 单卡bs=4这样训意义不大,如果你有资源,建议开大bs到总bs=32以上最好,并且开启amp混合精度训练降低显存,如果没有资源,可以训tiny版模型保持总bs32以上最好。 请问大佬为啥bs=4训练意义不大呢

Dandelion111 avatar Aug 20 '22 01:08 Dandelion111

修复了。 单卡bs=4这样训意义不大,如果你有资源,建议开大bs到总bs=32以上最好,并且开启amp混合精度训练降低显存,如果没有资源,可以训tiny版模型保持总bs32以上最好。

我有八卡 可是我一样的代码 八卡 读完参数就会hang住.... 之前有官方paddledetection多卡没问题.... batch size都调到1了可还是hang住

JackYang825 avatar Aug 21 '22 11:08 JackYang825

修复了。 单卡bs=4这样训意义不大,如果你有资源,建议开大bs到总bs=32以上最好,并且开启amp混合精度训练降低显存,如果没有资源,可以训tiny版模型保持总bs32以上最好。 请问大佬为啥bs=4训练意义不大呢

bs太小训精度会比正常默认bs训的低很多,显卡资源不够时不要训大模型。

nemonameless avatar Aug 21 '22 15:08 nemonameless

我有八卡 可是我一样的代码 八卡 读完参数就会hang住.... 之前有官方paddledetection多卡没问题.... batch size都调到1了可还是hang住

hang住情况可以截图下吗?调小bs和hang住应该没关系

nemonameless avatar Aug 21 '22 15:08 nemonameless

我有八卡 可是我一样的代码 八卡 读完参数就会hang住.... 之前有官方paddledetection多卡没问题.... batch size都调到1了可还是hang住

hang住情况可以截图下吗?调小bs和hang住应该没关系 截屏2022-08-22 09 53 40 batch size 2 yolov7p6_d6_300e_coco 截屏2022-08-22 09 55 07 截屏2022-08-22 09 58 36

JackYang825 avatar Aug 22 '22 01:08 JackYang825

修复了。 单卡bs=4这样训意义不大,如果你有资源,建议开大bs到总bs=32以上最好,并且开启amp混合精度训练降低显存,如果没有资源,可以训tiny版模型保持总bs32以上最好。 请问大佬为啥bs=4训练意义不大呢

单卡训练还是有问题 截屏2022-08-22 11 35 36

JackYang825 avatar Aug 22 '22 03:08 JackYang825

请用现在默认分支代码再试试。YOLO的检测建议bs至少32以上训。

nemonameless avatar Oct 14 '22 04:10 nemonameless