PaddleDetection PicoDet训练报错：(External) CUDA error(700), an illegal memory access was encountered.

问题确认 Search before asking

[X] 我已经查询历史issue，没有报过同样bug。I have searched the issues and found no similar bug report.

bug描述 Describe the Bug

[05/28 20:59:08] ppdet.utils.checkpoint INFO: Finish loading model weights: C:\Users\vision/.cache/paddle/weights\PPLCNet_x0_75_pretrained.pdparams Traceback (most recent call last): File "tools/train.py", line 177, in main() File "tools/train.py", line 173, in main run(FLAGS, cfg) File "tools/train.py", line 127, in run trainer.train(FLAGS.eval) File "D:\vision\PaddleDetection\ppdet\engine\trainer.py", line 442, in train outputs = model(data) File "D:\vision\anaconda3\envs\paddle\lib\site-packages\paddle\fluid\dygraph\layers.py", line 930, in call return self._dygraph_call_func(*inputs, **kwargs) File "D:\vision\anaconda3\envs\paddle\lib\site-packages\paddle\fluid\dygraph\layers.py", line 915, in _dygraph_call_func outputs = self.forward(*inputs, **kwargs) File "D:\vision\PaddleDetection\ppdet\modeling\architectures\meta_arch.py", line 59, in forward out = self.get_loss() File "D:\vision\PaddleDetection\ppdet\modeling\architectures\picodet.py", line 79, in get_loss loss_gfl = self.head.get_loss(head_outs, self.inputs) File "D:\vision\PaddleDetection\ppdet\modeling\heads\pico_head.py", line 723, in get_loss avg_factor=4.0) File "D:\vision\anaconda3\envs\paddle\lib\site-packages\paddle\fluid\dygraph\layers.py", line 930, in call return self._dygraph_call_func(*inputs, **kwargs) File "D:\vision\anaconda3\envs\paddle\lib\site-packages\paddle\fluid\dygraph\layers.py", line 915, in _dygraph_call_func outputs = self.forward(*inputs, **kwargs) File "D:\vision\PaddleDetection\ppdet\modeling\losses\gfocal_loss.py", line 199, in forward loss = self.loss_weight * distribution_focal_loss(pred, target) File "D:\vision\PaddleDetection\ppdet\modeling\losses\gfocal_loss.py", line 100, in distribution_focal_loss loss = F.cross_entropy(pred, dis_left, reduction='none') * weight_left
File "D:\vision\anaconda3\envs\paddle\lib\site-packages\paddle\nn\functional\loss.py", line 1714, in cross_entropy if label_min < 0: File "D:\vision\anaconda3\envs\paddle\lib\site-packages\paddle\fluid\dygraph\varbase_patch_methods.py", line 668, in bool return self.nonzero() File "D:\vision\anaconda3\envs\paddle\lib\site-packages\paddle\fluid\dygraph\varbase_patch_methods.py", line 665, in nonzero return bool(np.all(tensor.array() > 0)) OSError: (External) CUDA error(700), an illegal memory access was encountered. [Hint: 'cudaErrorIllegalAddress'. The device encountered a load or store instruction on an invalid memory address. This leaves the process in an inconsistentstate and any further CUDA work will return the same error. To continue using CUDA, the process must be terminated and relaunched. ] (at ..\paddle\phi\backends\gpu\cuda\cuda_info.cc:258)

复现环境 Environment

PaddlePadle 2.3.0 + cudnn11.2（Conda方式安装）
PaddleDetection 2.4/release
Windows10 + RTX 3070 Ti 8G

按30分钟快速上手PaddleDetection 可以正常训练yolov3_mobilenet_v1_roadsign

修改 picodet-s-416 lcnet 配置为roadsign_voc数据集训练则报错

是否愿意提交PR Are you willing to submit a PR?

[ ] Yes I'd like to help by submitting a PR!

May 28 '22 13:05 liyifan2002

将picodet-s-416 lcnet 的batch size改小试下，可能和显存不足有关

May 30 '22 03:05 jerrywgz

重新配置paddlepaddle-gpu==2.2.2.post112 解决了

Jun 18 '22 16:06 lj976264709

我装2.2.2版本报错，numpy版本问题，这个问题怎么解决？

Aug 22 '22 07:08 LLsmile

卸载后安装最高版本的numpy

Sep 13 '22 05:09 hedilong

确实是版本问题 paddlepaddle-gpu==2.2.2才可以最新2.3报错

Sep 20 '22 12:09 monkeycc

测试2.3.2 cuda11.6存在同样问题，2.2.2+cuda11.2版本正常

Oct 28 '22 01:10 light201212

paddlepaddle-gpu 2.3.2, cuda 11.6同样有此问题

Oct 31 '22 12:10 kaixin-bai

paddlepaddle-gpu 2.3.2, cuda 11.6同样有此问题

Nov 05 '22 04:11 983183947

paddlepaddle-gpu 2.3.2, cuda 11.6同样有此问题

PS D:\FY\AI\PaddleDetection-release-2.5> python tools/train.py -c ./configs/picodet/picodet_l_640_coco_lcnet.yml --eval Warning: Unable to use JDE/FairMOT/ByteTrack, please install lap, for example: pip install lap, see https://github.com/gatagat/lap INFO 2022-11-11 09:33:40,273 utils.py:147] Note: NumExpr detected 20 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8. W1111 09:33:40.493805 5080 gpu_resources.cc:61] Please NOTE: device: 0, GPU Compute Capability: 8.6, Driver API Version: 11.6, Runtime API Version: 11.6 W1111 09:33:40.503353 5080 gpu_resources.cc:91] device: 0, cuDNN Version: 8.6. [11/11 09:33:41] ppdet.utils.checkpoint INFO: ['last_conv.weight'] in pretrained weight is not used in the model, and its will not be loaded [11/11 09:33:41] ppdet.utils.checkpoint INFO: The shape [1000] in pretrained weight fc.bias is unmatched with the shape [160] in model head.conv_feat.se.0.fc.bias. And the weight fc.bias will not be loa ded [11/11 09:33:41] ppdet.utils.checkpoint INFO: The shape [1280, 1000] in pretrained weight fc.weight is unmatched with the shape [160, 160, 1, 1] in model head.conv_feat.se.0.fc.weight. And the weight fc .weight will not be loaded [11/11 09:33:41] ppdet.utils.checkpoint INFO: The shape [1000] in pretrained weight fc.bias is unmatched with the shape [160] in model head.conv_feat.se.1.fc.bias. And the weight fc.bias will not be loa ded [11/11 09:33:41] ppdet.utils.checkpoint INFO: The shape [1280, 1000] in pretrained weight fc.weight is unmatched with the shape [160, 160, 1, 1] in model head.conv_feat.se.1.fc.weight. And the weight fc .weight will not be loaded [11/11 09:33:41] ppdet.utils.checkpoint INFO: The shape [1000] in pretrained weight fc.bias is unmatched with the shape [160] in model head.conv_feat.se.2.fc.bias. And the weight fc.bias will not be loa ded [11/11 09:33:41] ppdet.utils.checkpoint INFO: The shape [1280, 1000] in pretrained weight fc.weight is unmatched with the shape [160, 160, 1, 1] in model head.conv_feat.se.2.fc.weight. And the weight fc .weight will not be loaded [11/11 09:33:41] ppdet.utils.checkpoint INFO: The shape [1000] in pretrained weight fc.bias is unmatched with the shape [160] in model head.conv_feat.se.3.fc.bias. And the weight fc.bias will not be loa ded [11/11 09:33:41] ppdet.utils.checkpoint INFO: The shape [1280, 1000] in pretrained weight fc.weight is unmatched with the shape [160, 160, 1, 1] in model head.conv_feat.se.3.fc.weight. And the weight fc .weight will not be loaded [11/11 09:33:41] ppdet.utils.checkpoint INFO: Finish loading model weights: C:\Users\tao.wan/.cache/paddle/weights\PPLCNet_x2_0_pretrained.pdparams Traceback (most recent call last): File "D:\FY\AI\PaddleDetection-release-2.5\tools\train.py", line 173, in main() File "D:\FY\AI\PaddleDetection-release-2.5\tools\train.py", line 169, in main run(FLAGS, cfg) File "D:\FY\AI\PaddleDetection-release-2.5\tools\train.py", line 133, in run trainer.train(FLAGS.eval) File "D:\FY\AI\PaddleDetection-release-2.5\ppdet\engine\trainer.py", line 506, in train outputs = model(data) File "D:\FY\Anaconda3\lib\site-packages\paddle\fluid\dygraph\layers.py", line 930, in call return self._dygraph_call_func(*inputs, **kwargs) File "D:\FY\Anaconda3\lib\site-packages\paddle\fluid\dygraph\layers.py", line 915, in _dygraph_call_func outputs = self.forward(*inputs, **kwargs) File "D:\FY\AI\PaddleDetection-release-2.5\ppdet\modeling\architectures\meta_arch.py", line 59, in forward out = self.get_loss() File "D:\FY\AI\PaddleDetection-release-2.5\ppdet\modeling\architectures\picodet.py", line 79, in get_loss loss_gfl = self.head.get_loss(head_outs, self.inputs) File "D:\FY\AI\PaddleDetection-release-2.5\ppdet\modeling\heads\pico_head.py", line 721, in get_loss loss_dfl = self.loss_dfl( File "D:\FY\Anaconda3\lib\site-packages\paddle\fluid\dygraph\layers.py", line 930, in call return self._dygraph_call_func(*inputs, **kwargs) File "D:\FY\Anaconda3\lib\site-packages\paddle\fluid\dygraph\layers.py", line 915, in _dygraph_call_func outputs = self.forward(*inputs, **kwargs) File "D:\FY\AI\PaddleDetection-release-2.5\ppdet\modeling\losses\gfocal_loss.py", line 199, in forward loss = self.loss_weight * distribution_focal_loss(pred, target) File "D:\FY\AI\PaddleDetection-release-2.5\ppdet\modeling\losses\gfocal_loss.py", line 100, in distribution_focal_loss loss = F.cross_entropy(pred, dis_left, reduction='none') * weight_left
File "D:\FY\Anaconda3\lib\site-packages\paddle\nn\functional\loss.py", line 1718, in cross_entropy if label_min < 0: File "D:\FY\Anaconda3\lib\site-packages\paddle\fluid\dygraph\varbase_patch_methods.py", line 669, in bool return self.nonzero() File "D:\FY\Anaconda3\lib\site-packages\paddle\fluid\dygraph\varbase_patch_methods.py", line 666, in nonzero return bool(np.all(tensor.array() > 0)) OSError: (External) CUDA error(700), an illegal memory access was encountered. [Hint: 'cudaErrorIllegalAddress'. The device encountered a load or store instruction on an invalid memory address. This leaves the process in an inconsistentstate and any further CUDA work will return the same error. To continue using CUDA, the process must be terminated and relaunched. ] (at ..\paddle\phi\backends\gpu\cuda\cuda_info.cc:258)

Nov 11 '22 01:11 wantao1008hh

pip install paddlepaddle-gpu==2.2.2.post112 -f https://www.paddlepaddle.org.cn/whl/windows/mkl/avx/stable.html 使用该版本可以解决

Nov 11 '22 01:11 wantao1008hh