PicoDet训练报错:(External) CUDA error(700), an illegal memory access was encountered.
问题确认 Search before asking
bug描述 Describe the Bug
[05/28 20:59:08] ppdet.utils.checkpoint INFO: Finish loading model weights: C:\Users\vision/.cache/paddle/weights\PPLCNet_x0_75_pretrained.pdparams
Traceback (most recent call last):
File "tools/train.py", line 177, in
File "D:\vision\anaconda3\envs\paddle\lib\site-packages\paddle\nn\functional\loss.py", line 1714, in cross_entropy
if label_min < 0:
File "D:\vision\anaconda3\envs\paddle\lib\site-packages\paddle\fluid\dygraph\varbase_patch_methods.py", line 668, in bool
return self.nonzero()
File "D:\vision\anaconda3\envs\paddle\lib\site-packages\paddle\fluid\dygraph\varbase_patch_methods.py", line 665, in nonzero
return bool(np.all(tensor.array() > 0))
OSError: (External) CUDA error(700), an illegal memory access was encountered.
[Hint: 'cudaErrorIllegalAddress'. The device encountered a load or store instruction on an invalid memory address. This leaves the process in an inconsistentstate and any further CUDA work will return the same error. To continue using CUDA, the process must be terminated and relaunched. ] (at ..\paddle\phi\backends\gpu\cuda\cuda_info.cc:258)
复现环境 Environment
- PaddlePadle 2.3.0 + cudnn11.2(Conda方式安装)
- PaddleDetection 2.4/release
- Windows10 + RTX 3070 Ti 8G
按30分钟快速上手PaddleDetection 可以正常训练yolov3_mobilenet_v1_roadsign
修改 picodet-s-416 lcnet 配置为roadsign_voc数据集训练则报错
是否愿意提交PR Are you willing to submit a PR?
- [ ] Yes I'd like to help by submitting a PR!
将picodet-s-416 lcnet 的batch size改小试下,可能和显存不足有关
重新配置paddlepaddle-gpu==2.2.2.post112 解决了
我装2.2.2版本报错,numpy版本问题,这个问题怎么解决?
卸载后安装最高版本的numpy
确实是版本问题 paddlepaddle-gpu==2.2.2才可以 最新2.3报错
测试2.3.2 cuda11.6存在同样问题,2.2.2+cuda11.2版本正常
paddlepaddle-gpu 2.3.2, cuda 11.6同样有此问题
paddlepaddle-gpu 2.3.2, cuda 11.6同样有此问题
paddlepaddle-gpu 2.3.2, cuda 11.6同样有此问题
PS D:\FY\AI\PaddleDetection-release-2.5> python tools/train.py -c ./configs/picodet/picodet_l_640_coco_lcnet.yml --eval
Warning: Unable to use JDE/FairMOT/ByteTrack, please install lap, for example: pip install lap, see https://github.com/gatagat/lap
INFO 2022-11-11 09:33:40,273 utils.py:147] Note: NumExpr detected 20 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
W1111 09:33:40.493805 5080 gpu_resources.cc:61] Please NOTE: device: 0, GPU Compute Capability: 8.6, Driver API Version: 11.6, Runtime API Version: 11.6
W1111 09:33:40.503353 5080 gpu_resources.cc:91] device: 0, cuDNN Version: 8.6.
[11/11 09:33:41] ppdet.utils.checkpoint INFO: ['last_conv.weight'] in pretrained weight is not used in the model, and its will not be loaded
[11/11 09:33:41] ppdet.utils.checkpoint INFO: The shape [1000] in pretrained weight fc.bias is unmatched with the shape [160] in model head.conv_feat.se.0.fc.bias. And the weight fc.bias will not be loa
ded
[11/11 09:33:41] ppdet.utils.checkpoint INFO: The shape [1280, 1000] in pretrained weight fc.weight is unmatched with the shape [160, 160, 1, 1] in model head.conv_feat.se.0.fc.weight. And the weight fc
.weight will not be loaded
[11/11 09:33:41] ppdet.utils.checkpoint INFO: The shape [1000] in pretrained weight fc.bias is unmatched with the shape [160] in model head.conv_feat.se.1.fc.bias. And the weight fc.bias will not be loa
ded
[11/11 09:33:41] ppdet.utils.checkpoint INFO: The shape [1280, 1000] in pretrained weight fc.weight is unmatched with the shape [160, 160, 1, 1] in model head.conv_feat.se.1.fc.weight. And the weight fc
.weight will not be loaded
[11/11 09:33:41] ppdet.utils.checkpoint INFO: The shape [1000] in pretrained weight fc.bias is unmatched with the shape [160] in model head.conv_feat.se.2.fc.bias. And the weight fc.bias will not be loa
ded
[11/11 09:33:41] ppdet.utils.checkpoint INFO: The shape [1280, 1000] in pretrained weight fc.weight is unmatched with the shape [160, 160, 1, 1] in model head.conv_feat.se.2.fc.weight. And the weight fc
.weight will not be loaded
[11/11 09:33:41] ppdet.utils.checkpoint INFO: The shape [1000] in pretrained weight fc.bias is unmatched with the shape [160] in model head.conv_feat.se.3.fc.bias. And the weight fc.bias will not be loa
ded
[11/11 09:33:41] ppdet.utils.checkpoint INFO: The shape [1280, 1000] in pretrained weight fc.weight is unmatched with the shape [160, 160, 1, 1] in model head.conv_feat.se.3.fc.weight. And the weight fc
.weight will not be loaded
[11/11 09:33:41] ppdet.utils.checkpoint INFO: Finish loading model weights: C:\Users\tao.wan/.cache/paddle/weights\PPLCNet_x2_0_pretrained.pdparams
Traceback (most recent call last):
File "D:\FY\AI\PaddleDetection-release-2.5\tools\train.py", line 173, in
File "D:\FY\Anaconda3\lib\site-packages\paddle\nn\functional\loss.py", line 1718, in cross_entropy
if label_min < 0:
File "D:\FY\Anaconda3\lib\site-packages\paddle\fluid\dygraph\varbase_patch_methods.py", line 669, in bool
return self.nonzero()
File "D:\FY\Anaconda3\lib\site-packages\paddle\fluid\dygraph\varbase_patch_methods.py", line 666, in nonzero
return bool(np.all(tensor.array() > 0))
OSError: (External) CUDA error(700), an illegal memory access was encountered.
[Hint: 'cudaErrorIllegalAddress'. The device encountered a load or store instruction on an invalid memory address. This leaves the process in an inconsistentstate and any further CUDA work will return
the same error. To continue using CUDA, the process must be terminated and relaunched. ] (at ..\paddle\phi\backends\gpu\cuda\cuda_info.cc:258)
pip install paddlepaddle-gpu==2.2.2.post112 -f https://www.paddlepaddle.org.cn/whl/windows/mkl/avx/stable.html 使用该版本可以解决