PaddleDetection icon indicating copy to clipboard operation
PaddleDetection copied to clipboard

ValueError: Target 460 is out of upper bound.

Open monkeycc opened this issue 2 years ago • 5 comments

问题确认 Search before asking

  • [X] 我已经搜索过问题,但是没有找到解答。I have searched the question and found no related answer.

请提出你的问题 Please ask your question

ppyoloe_crn_s_300e_coco VOC 数据集

python tools/train.py -c configs/ppyoloe/ppyoloe_crn_s_300e_coco.yml


W0823 14:30:26.446256  4452 gpu_resources.cc:61] Please NOTE: device: 0, GPU Compute Capability: 8.6, Driver API Version: 11.6, Runtime API Version: 11.2
W0823 14:30:26.461884  4452 gpu_resources.cc:91] device: 0, cuDNN Version: 8.2.
[08/23 14:30:27] ppdet.utils.checkpoint INFO: Finish loading model weights: C:\Users\fujunnnn/.cache/paddle/weights\CSPResNetb_s_pretrained.pdparams
[08/23 14:30:30] ppdet.engine INFO: Epoch: [0] [  0/339] learning_rate: 0.000000 loss: 1931307253760.000000 loss_cls: 0.594841 loss_iou: 772522901504.000000 loss_dfl: 5885.125977 loss_l1: 0.105123 eta: 4 days, 9:30:32 batch_cost: 3.7348 data_cost: 0.2500 ips: 2.6775 images/s
Traceback (most recent call last):
  File "tools/train.py", line 177, in <module>
    main()
  File "tools/train.py", line 173, in main
    run(FLAGS, cfg)
  File "tools/train.py", line 127, in run
    trainer.train(FLAGS.eval)
  File "E:\PaddleX_GUI_2.1.0_win10\PaddleDetection\ppdet\engine\trainer.py", line 454, in train
    outputs = model(data)
  File "E:\anaconda3\envs\PaddleDetection\lib\site-packages\paddle\fluid\dygraph\layers.py", line 930, in __call__
    return self._dygraph_call_func(*inputs, **kwargs)
  File "E:\anaconda3\envs\PaddleDetection\lib\site-packages\paddle\fluid\dygraph\layers.py", line 915, in _dygraph_call_func
    outputs = self.forward(*inputs, **kwargs)
  File "E:\PaddleX_GUI_2.1.0_win10\PaddleDetection\ppdet\modeling\architectures\meta_arch.py", line 59, in forward
    out = self.get_loss()
  File "E:\PaddleX_GUI_2.1.0_win10\PaddleDetection\ppdet\modeling\architectures\yolo.py", line 125, in get_loss
    return self._forward()
  File "E:\PaddleX_GUI_2.1.0_win10\PaddleDetection\ppdet\modeling\architectures\yolo.py", line 88, in _forward
    yolo_losses = self.yolo_head(neck_feats, self.inputs)
  File "E:\anaconda3\envs\PaddleDetection\lib\site-packages\paddle\fluid\dygraph\layers.py", line 930, in __call__
    return self._dygraph_call_func(*inputs, **kwargs)
  File "E:\anaconda3\envs\PaddleDetection\lib\site-packages\paddle\fluid\dygraph\layers.py", line 915, in _dygraph_call_func
    outputs = self.forward(*inputs, **kwargs)
  File "E:\PaddleX_GUI_2.1.0_win10\PaddleDetection\ppdet\modeling\heads\ppyoloe_head.py", line 217, in forward
    return self.forward_train(feats, targets)
  File "E:\PaddleX_GUI_2.1.0_win10\PaddleDetection\ppdet\modeling\heads\ppyoloe_head.py", line 160, in forward_train
    ], targets)
  File "E:\PaddleX_GUI_2.1.0_win10\PaddleDetection\ppdet\modeling\heads\ppyoloe_head.py", line 355, in get_loss
    assigned_scores_sum)
  File "E:\PaddleX_GUI_2.1.0_win10\PaddleDetection\ppdet\modeling\heads\ppyoloe_head.py", line 291, in _bbox_loss
    assigned_ltrb_pos) * bbox_weight
  File "E:\PaddleX_GUI_2.1.0_win10\PaddleDetection\ppdet\modeling\heads\ppyoloe_head.py", line 256, in _df_loss
    pred_dist, target_left, reduction='none') * weight_left
  File "E:\anaconda3\envs\PaddleDetection\lib\site-packages\paddle\nn\functional\loss.py", line 1723, in cross_entropy
    label_max.item()))
ValueError: Target 25479 is out of upper bound.

python tools/train.py -c configs/ppyoloe/ppyoloe_plus_crn_s_80e_coco.yml


W0823 21:31:38.730271 10200 gpu_resources.cc:61] Please NOTE: device: 0, GPU Compute Capability: 8.6, Driver API Version: 11.6, Runtime API Version: 11.2
W0823 21:31:38.750262 10200 gpu_resources.cc:91] device: 0, cuDNN Version: 8.2.
[08/23 21:31:40] ppdet.utils.checkpoint INFO: The shape [365] in pretrained weight yolo_head.pred_cls.0.bias is unmatched with the shape [4] in model yolo_head.pred_cls.0.bias. And the weight yolo_head.pred_cls.0.bias will not be loaded
[08/23 21:31:40] ppdet.utils.checkpoint INFO: The shape [365, 384, 3, 3] in pretrained weight yolo_head.pred_cls.0.weight is unmatched with the shape [4, 384, 3, 3] in model yolo_head.pred_cls.0.weight. And the weight yolo_head.pred_cls.0.weight will not be loaded
[08/23 21:31:40] ppdet.utils.checkpoint INFO: The shape [365] in pretrained weight yolo_head.pred_cls.1.bias is unmatched with the shape [4] in model yolo_head.pred_cls.1.bias. And the weight yolo_head.pred_cls.1.bias will not be loaded
[08/23 21:31:40] ppdet.utils.checkpoint INFO: The shape [365, 192, 3, 3] in pretrained weight yolo_head.pred_cls.1.weight is unmatched with the shape [4, 192, 3, 3] in model yolo_head.pred_cls.1.weight. And the weight yolo_head.pred_cls.1.weight will not be loaded
[08/23 21:31:40] ppdet.utils.checkpoint INFO: The shape [365] in pretrained weight yolo_head.pred_cls.2.bias is unmatched with the shape [4] in model yolo_head.pred_cls.2.bias. And the weight yolo_head.pred_cls.2.bias will not be loaded
[08/23 21:31:40] ppdet.utils.checkpoint INFO: The shape [365, 96, 3, 3] in pretrained weight yolo_head.pred_cls.2.weight is unmatched with the shape [4, 96, 3, 3] in model yolo_head.pred_cls.2.weight. And the weight yolo_head.pred_cls.2.weight will not be loaded
[08/23 21:31:40] ppdet.utils.checkpoint INFO: Finish loading model weights: C:\Users\MM/.cache/paddle/weights\ppyoloe_crn_s_obj365_pretrained.pdparams
Traceback (most recent call last):
  File "tools/train.py", line 172, in <module>
    main()
  File "tools/train.py", line 168, in main
    run(FLAGS, cfg)
  File "tools/train.py", line 132, in run
    trainer.train(FLAGS.eval)
  File "D:\0SDXX\PaddleDetection\ppdet\engine\trainer.py", line 504, in train
    outputs = model(data)
  File "D:\Anaconda3\envs\PaddleSeg\lib\site-packages\paddle\fluid\dygraph\layers.py", line 930, in __call__
    return self._dygraph_call_func(*inputs, **kwargs)
  File "D:\Anaconda3\envs\PaddleSeg\lib\site-packages\paddle\fluid\dygraph\layers.py", line 915, in _dygraph_call_func
    outputs = self.forward(*inputs, **kwargs)
  File "D:\0SDXX\PaddleDetection\ppdet\modeling\architectures\meta_arch.py", line 59, in forward
    out = self.get_loss()
  File "D:\0SDXX\PaddleDetection\ppdet\modeling\architectures\yolo.py", line 124, in get_loss
    return self._forward()
  File "D:\0SDXX\PaddleDetection\ppdet\modeling\architectures\yolo.py", line 88, in _forward
    yolo_losses = self.yolo_head(neck_feats, self.inputs)
  File "D:\Anaconda3\envs\PaddleSeg\lib\site-packages\paddle\fluid\dygraph\layers.py", line 930, in __call__
    return self._dygraph_call_func(*inputs, **kwargs)
  File "D:\Anaconda3\envs\PaddleSeg\lib\site-packages\paddle\fluid\dygraph\layers.py", line 915, in _dygraph_call_func
    outputs = self.forward(*inputs, **kwargs)
  File "D:\0SDXX\PaddleDetection\ppdet\modeling\heads\ppyoloe_head.py", line 216, in forward
    return self.forward_train(feats, targets)
  File "D:\0SDXX\PaddleDetection\ppdet\modeling\heads\ppyoloe_head.py", line 161, in forward_train
    ], targets)
  File "D:\0SDXX\PaddleDetection\ppdet\modeling\heads\ppyoloe_head.py", line 354, in get_loss
    assigned_scores_sum)
  File "D:\0SDXX\PaddleDetection\ppdet\modeling\heads\ppyoloe_head.py", line 290, in _bbox_loss
    assigned_ltrb_pos) * bbox_weight
  File "D:\0SDXX\PaddleDetection\ppdet\modeling\heads\ppyoloe_head.py", line 255, in _df_loss
    pred_dist, target_left, reduction='none') * weight_left
  File "D:\Anaconda3\envs\PaddleSeg\lib\site-packages\paddle\nn\functional\loss.py", line 1723, in cross_entropy
    label_max.item()))
ValueError: Target 28 is out of upper bound.

monkeycc avatar Aug 23 '22 05:08 monkeycc

有试过其他模型嘛 有这个问题嘛

lyuwenyu avatar Aug 24 '22 02:08 lyuwenyu

有修改过什么配置嘛?

ghostxsl avatar Aug 24 '22 09:08 ghostxsl

win上请换用paddle2.2.2,高版本暂时有bug会尽快修。linux上版本没问题。

nemonameless avatar Aug 24 '22 11:08 nemonameless

win上请换用paddle2.2.2,高版本暂时有bug会尽快修。linux上版本没问题。

如何安装带gpu的2.2.2版本?有说明吗?

ionescofung avatar Sep 18 '22 02:09 ionescofung

@ionescofung https://www.paddlepaddle.org.cn/install/quick?docurl=/documentation/docs/zh/install/pip/windows-pip.html 安装参考这个,安装命令中的版本设置为paddlepaddle-gpu==2.2.2就可以了

xiegegege avatar Sep 20 '22 06:09 xiegegege

我使用develop版本还是会报同样的错误,实在是不想退回2.2.2,因为不支持cuda11.6,还得安装11.2

lazyn1997 avatar Nov 20 '22 15:11 lazyn1997

我使用develop版本还是会报同样的错误,实在是不想退回2.2.2,因为不支持cuda11.6,还得安装11.2

@lazyn1997 我们这里develop测试是正常的

xiegegege avatar Nov 23 '22 06:11 xiegegege

和训练的网络有关系吗,我用的ppyoloe

lazyn1997 avatar Nov 23 '22 12:11 lazyn1997

你是单卡训练的嘛?

ghostxsl avatar Nov 23 '22 12:11 ghostxsl

是的,单卡训练

lazyn1997 avatar Nov 23 '22 13:11 lazyn1997

是的,单卡训练

你PaddleDetection版本是多少?

ghostxsl avatar Nov 23 '22 13:11 ghostxsl

是的,单卡训练

你PaddleDetection版本是多少?

release/2.5

lazyn1997 avatar Nov 23 '22 13:11 lazyn1997

你看一下你那边的代码有没有这一行:https://github.com/PaddlePaddle/PaddleDetection/blob/release/2.5/ppdet/modeling/heads/ppyoloe_head.py#L350

ghostxsl avatar Nov 23 '22 13:11 ghostxsl

有这一句

lazyn1997 avatar Nov 23 '22 13:11 lazyn1997

你拉取最新的代码再跑一下,print一下assigned_scores_sumdebug看看

ghostxsl avatar Nov 23 '22 13:11 ghostxsl

或者你可以提供给我你的环境嘛?我这边本地实在无法复现这个问题

ghostxsl avatar Nov 23 '22 13:11 ghostxsl

Package Version


astor 0.8.1 attrs 22.1.0 Babel 2.11.0 bce-python-sdk 0.8.74 certifi 2022.9.24 charset-normalizer 2.1.1 click 8.1.3 colorama 0.4.6 cycler 0.11.0 Cython 0.29.32 decorator 5.1.1 dill 0.3.6 exceptiongroup 1.0.4 filterpy 1.4.5 Flask 2.2.2 Flask-Babel 2.0.0 fonttools 4.25.0 future 0.18.2 idna 3.4 importlib-metadata 5.0.0 iniconfig 1.1.1 itsdangerous 2.1.2 Jinja2 3.1.2 joblib 1.2.0 kiwisolver 1.4.2 lap 0.4.0 lxml 4.9.1 MarkupSafe 2.1.1 matplotlib 3.5.2 mkl-fft 1.3.1 mkl-random 1.2.2 mkl-service 2.4.0 motmetrics 1.2.5 multiprocess 0.70.14 munkres 1.1.4 numpy 1.21.5 opencv-python 4.6.0.66 opt-einsum 3.3.0 packaging 21.3 paddle-bfloat 0.1.7 paddledet 2.5.0 paddlepaddle-gpu 0.0.0.post116 pandas 1.3.5 Pillow 9.3.0 pip 22.2.2 pluggy 1.0.0 protobuf 3.20.0 pyclipper 1.3.0.post4 pycocotools 2.0.2 pycryptodome 3.15.0 pyparsing 3.0.9 PyQt5 5.15.7 PyQt5-Qt5 5.15.2 PyQt5-sip 12.11.0 pytest 7.2.0 pytest-timeout 2.1.0 python-dateutil 2.8.2 pytz 2022.6 PyYAML 6.0 requests 2.28.1 scikit-learn 1.0.2 scipy 1.7.3 setuptools 65.5.0 Shapely 1.8.5.post1 sip 4.19.13 six 1.16.0 sklearn 0.0 terminaltables 3.1.10 threadpoolctl 3.1.0 tomli 2.0.1 tornado 6.2 tqdm 4.64.1 typeguard 2.13.3 typing_extensions 4.3.0 urllib3 1.26.12 visualdl 2.4.1 Werkzeug 2.2.2 wheel 0.37.1 wincertstore 0.2 xmltodict 0.13.0 zipp 3.10.0

(paddle_env) PS E:\Documents\code\GitHub\PaddleDetection> python -u tools/train.py -c .\configs\ppyoloe\ppyoloe_crn_x_300e_LYC_2019_12.yml --eval W1123 21:33:49.114674 22704 gpu_resources.cc:61] Please NOTE: device: 0, GPU Compute Capability: 8.6, Driver API Version: 12.0, Runtime API Version: 11.6 W1123 21:33:49.118680 22704 gpu_resources.cc:91] device: 0, cuDNN Version: 8.5. [11/23 21:33:50] ppdet.utils.checkpoint INFO: Finish loading model weights: pretrain_weights/CSPResNetb_x_pretrained.pdparams [11/23 21:33:53] ppdet.engine INFO: Epoch: [0] [ 0/155] learning_rate: 0.000000 loss: -34113.761719 loss_cls: 0.155425 loss_iou: -22994.197266 loss_dfl: 46743.148438 loss_l1: 11.302123 eta: 1 day, 7:21:36 batch_cost: 2.4279 data_cost: 0.2216 ips: 1.6475 images/s

lazyn1997 avatar Nov 23 '22 13:11 lazyn1997

这是我的库和cuda环境,显卡移动端3080,win11系统

lazyn1997 avatar Nov 23 '22 13:11 lazyn1997

打印的话报错就会出现nan Error: C:\home\workspace\Paddle\paddle\phi\kernels\gpu\bce_loss_kernel.cu:42 Assertion (x >= static_cast<T>(0)) && (x <= one) failed. Input is expected to be within the interval [0, 1], but received nan. Error: C:\home\workspace\Paddle\paddle\phi\kernels\gpu\bce_loss_kernel.cu:42 Assertion (x >= static_cast<T>(0)) && (x <= one) failed. Input is expected to be within the interval [0, 1], but received nan.

lazyn1997 avatar Nov 24 '22 01:11 lazyn1997

使用CPU是可以正常跑的

lazyn1997 avatar Nov 24 '22 02:11 lazyn1997

好的,我找台Windows的机器复现一下,感觉是某个算子在Windows平台下GPU kernel有问题导致的出nan

ghostxsl avatar Nov 24 '22 03:11 ghostxsl

辛苦了

lazyn1997 avatar Nov 24 '22 03:11 lazyn1997

首先,我这边在Windows上复现了这个问题,是paddle框架的bug,paddle.masked_select这个算子在gpu下的运算是错误的。附上截图: 1bc72ff51f93f73814779f8a1f3c5d45 ppyoloe模型在计算loss的时候使用到了这个算子,导致了后续结果出nan。

其次,这个问题我是在Python3.7的环境下才能复现,在Python3.9的环境下是正常的,附上截图: image

最后,这个问题我已经反馈给了Paddle框架的同学,后续会进行排期修复。为了不影响你使用,建议你试一下在Python3.9环境下安装paddle-develop版本跑ppyoloe模型的训练,给你带来的不便,我们深感抱歉~

ghostxsl avatar Nov 24 '22 08:11 ghostxsl

好的感谢,我确实也是3.7版本

lazyn1997 avatar Nov 24 '22 08:11 lazyn1997