PaddleDetection icon indicating copy to clipboard operation
PaddleDetection copied to clipboard

PicoDet训练报错:(External) CUDA error(700), an illegal memory access was encountered.

Open liyifan2002 opened this issue 3 years ago • 5 comments

问题确认 Search before asking

  • [X] 我已经查询历史issue,没有报过同样bug。I have searched the issues and found no similar bug report.

bug描述 Describe the Bug

[05/28 20:59:08] ppdet.utils.checkpoint INFO: Finish loading model weights: C:\Users\vision/.cache/paddle/weights\PPLCNet_x0_75_pretrained.pdparams Traceback (most recent call last): File "tools/train.py", line 177, in main() File "tools/train.py", line 173, in main run(FLAGS, cfg) File "tools/train.py", line 127, in run trainer.train(FLAGS.eval) File "D:\vision\PaddleDetection\ppdet\engine\trainer.py", line 442, in train outputs = model(data) File "D:\vision\anaconda3\envs\paddle\lib\site-packages\paddle\fluid\dygraph\layers.py", line 930, in call return self._dygraph_call_func(*inputs, **kwargs) File "D:\vision\anaconda3\envs\paddle\lib\site-packages\paddle\fluid\dygraph\layers.py", line 915, in _dygraph_call_func outputs = self.forward(*inputs, **kwargs) File "D:\vision\PaddleDetection\ppdet\modeling\architectures\meta_arch.py", line 59, in forward out = self.get_loss() File "D:\vision\PaddleDetection\ppdet\modeling\architectures\picodet.py", line 79, in get_loss loss_gfl = self.head.get_loss(head_outs, self.inputs) File "D:\vision\PaddleDetection\ppdet\modeling\heads\pico_head.py", line 723, in get_loss avg_factor=4.0) File "D:\vision\anaconda3\envs\paddle\lib\site-packages\paddle\fluid\dygraph\layers.py", line 930, in call return self._dygraph_call_func(*inputs, **kwargs) File "D:\vision\anaconda3\envs\paddle\lib\site-packages\paddle\fluid\dygraph\layers.py", line 915, in _dygraph_call_func outputs = self.forward(*inputs, **kwargs) File "D:\vision\PaddleDetection\ppdet\modeling\losses\gfocal_loss.py", line 199, in forward loss = self.loss_weight * distribution_focal_loss(pred, target) File "D:\vision\PaddleDetection\ppdet\modeling\losses\gfocal_loss.py", line 100, in distribution_focal_loss loss = F.cross_entropy(pred, dis_left, reduction='none') * weight_left
File "D:\vision\anaconda3\envs\paddle\lib\site-packages\paddle\nn\functional\loss.py", line 1714, in cross_entropy if label_min < 0: File "D:\vision\anaconda3\envs\paddle\lib\site-packages\paddle\fluid\dygraph\varbase_patch_methods.py", line 668, in bool return self.nonzero() File "D:\vision\anaconda3\envs\paddle\lib\site-packages\paddle\fluid\dygraph\varbase_patch_methods.py", line 665, in nonzero return bool(np.all(tensor.array() > 0)) OSError: (External) CUDA error(700), an illegal memory access was encountered. [Hint: 'cudaErrorIllegalAddress'. The device encountered a load or store instruction on an invalid memory address. This leaves the process in an inconsistentstate and any further CUDA work will return the same error. To continue using CUDA, the process must be terminated and relaunched. ] (at ..\paddle\phi\backends\gpu\cuda\cuda_info.cc:258)

复现环境 Environment

  • PaddlePadle 2.3.0 + cudnn11.2(Conda方式安装)
  • PaddleDetection 2.4/release
  • Windows10 + RTX 3070 Ti 8G

30分钟快速上手PaddleDetection 可以正常训练yolov3_mobilenet_v1_roadsign

修改 picodet-s-416 lcnet 配置为roadsign_voc数据集训练则报错

是否愿意提交PR Are you willing to submit a PR?

  • [ ] Yes I'd like to help by submitting a PR!

liyifan2002 avatar May 28 '22 13:05 liyifan2002

将picodet-s-416 lcnet 的batch size改小试下,可能和显存不足有关

jerrywgz avatar May 30 '22 03:05 jerrywgz

重新配置paddlepaddle-gpu==2.2.2.post112 解决了

lj976264709 avatar Jun 18 '22 16:06 lj976264709

我装2.2.2版本报错,numpy版本问题,这个问题怎么解决?

LLsmile avatar Aug 22 '22 07:08 LLsmile

卸载后安装最高版本的numpy

hedilong avatar Sep 13 '22 05:09 hedilong

确实是版本问题 paddlepaddle-gpu==2.2.2才可以 最新2.3报错

monkeycc avatar Sep 20 '22 12:09 monkeycc

测试2.3.2 cuda11.6存在同样问题,2.2.2+cuda11.2版本正常

light201212 avatar Oct 28 '22 01:10 light201212

paddlepaddle-gpu 2.3.2, cuda 11.6同样有此问题

kaixin-bai avatar Oct 31 '22 12:10 kaixin-bai

paddlepaddle-gpu 2.3.2, cuda 11.6同样有此问题

983183947 avatar Nov 05 '22 04:11 983183947

paddlepaddle-gpu 2.3.2, cuda 11.6同样有此问题

PS D:\FY\AI\PaddleDetection-release-2.5> python tools/train.py -c ./configs/picodet/picodet_l_640_coco_lcnet.yml --eval Warning: Unable to use JDE/FairMOT/ByteTrack, please install lap, for example: pip install lap, see https://github.com/gatagat/lap INFO 2022-11-11 09:33:40,273 utils.py:147] Note: NumExpr detected 20 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8. W1111 09:33:40.493805 5080 gpu_resources.cc:61] Please NOTE: device: 0, GPU Compute Capability: 8.6, Driver API Version: 11.6, Runtime API Version: 11.6 W1111 09:33:40.503353 5080 gpu_resources.cc:91] device: 0, cuDNN Version: 8.6. [11/11 09:33:41] ppdet.utils.checkpoint INFO: ['last_conv.weight'] in pretrained weight is not used in the model, and its will not be loaded [11/11 09:33:41] ppdet.utils.checkpoint INFO: The shape [1000] in pretrained weight fc.bias is unmatched with the shape [160] in model head.conv_feat.se.0.fc.bias. And the weight fc.bias will not be loa ded [11/11 09:33:41] ppdet.utils.checkpoint INFO: The shape [1280, 1000] in pretrained weight fc.weight is unmatched with the shape [160, 160, 1, 1] in model head.conv_feat.se.0.fc.weight. And the weight fc .weight will not be loaded [11/11 09:33:41] ppdet.utils.checkpoint INFO: The shape [1000] in pretrained weight fc.bias is unmatched with the shape [160] in model head.conv_feat.se.1.fc.bias. And the weight fc.bias will not be loa ded [11/11 09:33:41] ppdet.utils.checkpoint INFO: The shape [1280, 1000] in pretrained weight fc.weight is unmatched with the shape [160, 160, 1, 1] in model head.conv_feat.se.1.fc.weight. And the weight fc .weight will not be loaded [11/11 09:33:41] ppdet.utils.checkpoint INFO: The shape [1000] in pretrained weight fc.bias is unmatched with the shape [160] in model head.conv_feat.se.2.fc.bias. And the weight fc.bias will not be loa ded [11/11 09:33:41] ppdet.utils.checkpoint INFO: The shape [1280, 1000] in pretrained weight fc.weight is unmatched with the shape [160, 160, 1, 1] in model head.conv_feat.se.2.fc.weight. And the weight fc .weight will not be loaded [11/11 09:33:41] ppdet.utils.checkpoint INFO: The shape [1000] in pretrained weight fc.bias is unmatched with the shape [160] in model head.conv_feat.se.3.fc.bias. And the weight fc.bias will not be loa ded [11/11 09:33:41] ppdet.utils.checkpoint INFO: The shape [1280, 1000] in pretrained weight fc.weight is unmatched with the shape [160, 160, 1, 1] in model head.conv_feat.se.3.fc.weight. And the weight fc .weight will not be loaded [11/11 09:33:41] ppdet.utils.checkpoint INFO: Finish loading model weights: C:\Users\tao.wan/.cache/paddle/weights\PPLCNet_x2_0_pretrained.pdparams Traceback (most recent call last): File "D:\FY\AI\PaddleDetection-release-2.5\tools\train.py", line 173, in main() File "D:\FY\AI\PaddleDetection-release-2.5\tools\train.py", line 169, in main run(FLAGS, cfg) File "D:\FY\AI\PaddleDetection-release-2.5\tools\train.py", line 133, in run trainer.train(FLAGS.eval) File "D:\FY\AI\PaddleDetection-release-2.5\ppdet\engine\trainer.py", line 506, in train outputs = model(data) File "D:\FY\Anaconda3\lib\site-packages\paddle\fluid\dygraph\layers.py", line 930, in call return self._dygraph_call_func(*inputs, **kwargs) File "D:\FY\Anaconda3\lib\site-packages\paddle\fluid\dygraph\layers.py", line 915, in _dygraph_call_func outputs = self.forward(*inputs, **kwargs) File "D:\FY\AI\PaddleDetection-release-2.5\ppdet\modeling\architectures\meta_arch.py", line 59, in forward out = self.get_loss() File "D:\FY\AI\PaddleDetection-release-2.5\ppdet\modeling\architectures\picodet.py", line 79, in get_loss loss_gfl = self.head.get_loss(head_outs, self.inputs) File "D:\FY\AI\PaddleDetection-release-2.5\ppdet\modeling\heads\pico_head.py", line 721, in get_loss loss_dfl = self.loss_dfl( File "D:\FY\Anaconda3\lib\site-packages\paddle\fluid\dygraph\layers.py", line 930, in call return self._dygraph_call_func(*inputs, **kwargs) File "D:\FY\Anaconda3\lib\site-packages\paddle\fluid\dygraph\layers.py", line 915, in _dygraph_call_func outputs = self.forward(*inputs, **kwargs) File "D:\FY\AI\PaddleDetection-release-2.5\ppdet\modeling\losses\gfocal_loss.py", line 199, in forward loss = self.loss_weight * distribution_focal_loss(pred, target) File "D:\FY\AI\PaddleDetection-release-2.5\ppdet\modeling\losses\gfocal_loss.py", line 100, in distribution_focal_loss loss = F.cross_entropy(pred, dis_left, reduction='none') * weight_left
File "D:\FY\Anaconda3\lib\site-packages\paddle\nn\functional\loss.py", line 1718, in cross_entropy if label_min < 0: File "D:\FY\Anaconda3\lib\site-packages\paddle\fluid\dygraph\varbase_patch_methods.py", line 669, in bool return self.nonzero() File "D:\FY\Anaconda3\lib\site-packages\paddle\fluid\dygraph\varbase_patch_methods.py", line 666, in nonzero return bool(np.all(tensor.array() > 0)) OSError: (External) CUDA error(700), an illegal memory access was encountered. [Hint: 'cudaErrorIllegalAddress'. The device encountered a load or store instruction on an invalid memory address. This leaves the process in an inconsistentstate and any further CUDA work will return the same error. To continue using CUDA, the process must be terminated and relaunched. ] (at ..\paddle\phi\backends\gpu\cuda\cuda_info.cc:258)

wantao1008hh avatar Nov 11 '22 01:11 wantao1008hh

pip install paddlepaddle-gpu==2.2.2.post112 -f https://www.paddlepaddle.org.cn/whl/windows/mkl/avx/stable.html 使用该版本可以解决

wantao1008hh avatar Nov 11 '22 01:11 wantao1008hh