PaddleDetection 在使用COCO数据集训练RetinaNet时（并未修改任何参数），出现OSError: (External) CUDA error(719), unspecified launch failure.

问题确认 Search before asking

[X] 我已经查询历史issue，没有报过同样bug。I have searched the issues and found no similar bug report.

bug描述 Describe the Bug

python tools/train.py -c configs/retinanet/retinanet_r50_fpn_1x_coco.yml --eval -o use_gpu=ture

会报错： [07/05 21:38:33] ppdet.utils.checkpoint INFO: Finish loading model weights: C:\Users\Hubery/.cache/paddle/weights\ResNet50_cos_pretrained.pdparams Error: ../paddle/phi/kernels/funcs/gather.cu.h:67 Assertion index_value >= 0 && index_value < input_dims[j] failed. The index is out of bounds, please check whether the dimensions of index and input meet the requirements. It should be less than [182403] and greater than or equal to 0, but received [0] 有很多行跟这个一样的错误

Traceback (most recent call last): File "tools/train.py", line 177, in main() File "tools/train.py", line 173, in main run(FLAGS, cfg) File "tools/train.py", line 127, in run trainer.train(FLAGS.eval) File "D:\deep_learning\PaddleDetection\ppdet\engine\trainer.py", line 448, in train outputs = model(data) File "C:\Users\Hubery\anaconda3\envs\paddle_env\lib\site-packages\paddle\fluid\dygraph\layers.py", line 930, in call return self._dygraph_call_func(*inputs, **kwargs) File "C:\Users\Hubery\anaconda3\envs\paddle_env\lib\site-packages\paddle\fluid\dygraph\layers.py", line 915, in _dygraph_call_func outputs = self.forward(*inputs, **kwargs) File "D:\deep_learning\PaddleDetection\ppdet\modeling\architectures\meta_arch.py", line 59, in forward out = self.get_loss() File "D:\deep_learning\PaddleDetection\ppdet\modeling\architectures\retinanet.py", line 65, in get_loss return self._forward() File "D:\deep_learning\PaddleDetection\ppdet\modeling\architectures\retinanet.py", line 57, in _forward return self.head(neck_feats, self.inputs) File "C:\Users\Hubery\anaconda3\envs\paddle_env\lib\site-packages\paddle\fluid\dygraph\layers.py", line 930, in call return self._dygraph_call_func(*inputs, **kwargs) File "C:\Users\Hubery\anaconda3\envs\paddle_env\lib\site-packages\paddle\fluid\dygraph\layers.py", line 915, in _dygraph_call_func outputs = self.forward(*inputs, **kwargs) File "D:\deep_learning\PaddleDetection\ppdet\modeling\heads\retina_head.py", line 105, in forward return self.get_loss([cls_logits_list, bboxes_reg_list], targets) File "D:\deep_learning\PaddleDetection\ppdet\modeling\heads\retina_head.py", line 160, in get_loss cls_tar = gt_class[matches[chosen_mask]] File "C:\Users\Hubery\anaconda3\envs\paddle_env\lib\site-packages\paddle\fluid\dygraph\varbase_patch_methods.py", line 735, in getitem return getitem_impl(self, item) File "C:\Users\Hubery\anaconda3\envs\paddle_env\lib\site-packages\paddle\fluid\variable_index.py", line 430, in getitem_impl return get_value_for_bool_tensor(var, slice_item) File "C:\Users\Hubery\anaconda3\envs\paddle_env\lib\site-packages\paddle\fluid\variable_index.py", line 310, in get_value_for_bool_tensor lambda: idx_empty(var)) File "C:\Users\Hubery\anaconda3\envs\paddle_env\lib\site-packages\paddle\fluid\layers\control_flow.py", line 2452, in cond pred = pred.numpy()[0] OSError: (External) CUDA error(719), unspecified launch failure. [Hint: 'cudaErrorLaunchFailure'. An exception occurred on the device while executing a kernel. Common causes include dereferencing an invalid device pointerand accessing out of bounds shared memory. Less common cases can be system specific - more information about these cases canbe found in the system specific user guide. This leaves the process in an inconsistent state and any further CUDA work willreturn the same error. To continue using CUDA, the process must be terminated and relaunched.] (at ..\paddle\phi\backends\gpu\cuda\cuda_info.cc:258)

请问该如何解决呢？

复现环境 Environment

-PaddlePaddle 2.3.0.post112 -paddledet 2.4.0 -python -3.7 -cudatoolkit 11.2.2 -cudnn 8.2.1.32

显卡型号：RTX 3090

nvidia-smi

Tue Jul 5 21:12:59 2022 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 472.12 Driver Version: 472.12 CUDA Version: 11.4 | |-------------------------------+----------------------+----------------------+ | GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC | | 0 N/A N/A 13168 C+G ...lPanel\SystemSettings.exe N/A | | 0 N/A N/A 13228 C+G ...cw5n1h2txyewy\LockApp.exe N/A | | 0 N/A N/A 13284 C+G ...bbwe\Microsoft.Photos.exe N/A | | 0 N/A N/A 13572 C+G ...ge\Application\msedge.exe N/A | | 0 N/A N/A 18156 C+G ...2txyewy\TextInputHost.exe N/A | | 0 N/A N/A 18600 C+G ...mathpix-snipping-tool.exe N/A | | 0 N/A N/A 19792 C+G ...y\ShellExperienceHost.exe N/A | | 0 N/A N/A 20504 C+G ...1\jbr\bin\jcef_helper.exe N/A | | 0 N/A N/A 20676 C+G ...264.44\msedgewebview2.exe N/A | | 0 N/A N/A 20984 C+G ...8wekyb3d8bbwe\GameBar.exe N/A | +-----------------------------------------------------------------------------+

nvcc -V

nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2020 NVIDIA Corporation Built on Mon_Nov_30_19:15:10_Pacific_Standard_Time_2020 Cuda compilation tools, release 11.2, V11.2.67 Build cuda_11.2.r11.2/compiler.29373293_0

并且test_architectures.py和python tools/infer.py -c configs/ppyolo/ppyolo_r50vd_dcn_1x_coco.yml -o use_gpu=true weights=https://paddledet.bj.bcebos.com/models/ppyolo_r50vd_dcn_1x_coco.pdparams --infer_img=demo/000000014439.jpg测试通过。 ....... Ran 7 tests in 2.265s

OK

Done (t=0.75s) creating index... index created! 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 1.19it/s] [07/06 10:08:15] ppdet.engine INFO: Detection bbox results save in output\000000014439.jpg

跑yolov3就可以正常跑

python tools/train.py -c configs/yolov3/yolov3_darknet53_270e_coco.yml --eval -o use_gpu=ture

[07/06 10:39:58] ppdet.utils.checkpoint INFO: Finish loading model weights: C:\Users\Hubery/.cache/paddle/weights\DarkNet53_pretrained.pdparams [07/06 10:40:00] ppdet.engine INFO: Epoch: [0] [ 0/14658] learning_rate: 0.000000 loss_xy: 8.961815 loss_wh: 10.346878 loss_obj: 10171.745117 loss_cls: 193.526794 loss: 10384.580078 eta: 89 days, 16:38:52 batch_cost: 1.9581 data_cost: 0.0000 ips: 4.0856 images/s [07/06 10:40:11] ppdet.engine INFO: Epoch: [0] [ 20/14658] learning_rate: 0.000005 loss_xy: 17.073185 loss_wh: 17.941078 loss_obj: 831.352783 loss_cls: 329.363922 loss: 1140.583252 eta: 27 days, 14:10:25 batch_cost: 0.5345 data_cost: 0.3638 ips: 14.9660 images/s .................................

是否愿意提交PR Are you willing to submit a PR?

[ ] Yes I'd like to help by submitting a PR!

Jul 06 '22 02:07 Ainult

建议先查一下ResNet50_cos_pretrained.pdparams预训练权重是否完全下载到了，或者删了重新下载再训试试。训练我们这边自测没有问题的。

Jul 06 '22 03:07 nemonameless

你学习率没调

Jul 06 '22 05:07 ChenjieXu

你学习率没调

您好，请问optimizer_1x.yml内这样设置有问题吗？

epoch: 12

LearningRate: base_lr: 0.01 schedulers:

!PiecewiseDecay gamma: 0.1 milestones: [8, 11]
!LinearWarmup start_factor: 0.001 steps: 500

OptimizerBuilder: optimizer: momentum: 0.9 type: Momentum regularizer: factor: 0.0001 type: L2

Jul 06 '22 07:07 Ainult

我也有这个问题，win10跑faster_rcnn_swin_tiny_fpn_3x_coco时候，但是在aistudio上跑就没问题

Jul 13 '22 09:07 RONINGOD

我也有这个问题，win10跑faster_rcnn_swin_tiny_fpn_3x_coco时候，但是在aistudio上跑就没问题

我在win10（本身有装cuda跟cudnn）上装了两个虚拟环境，虚拟环境A：有cuda跟cudnn；虚拟环境B：没有cuda跟cudnn 在环境A上，就不能跑RetinaNet；在环境B上就能跑RetinaNet。我不知道是不是因为环境A与win10上的cuda+cudnn有冲突。

Jul 13 '22 09:07 Ainult

解决了是requirements.txt中一个包之前一直不能安装cython-bbox，需要修改配置文件在Win10:https://yuki-ho.blog.csdn.net/article/details/106692395?spm=1001.2101.3001.6650.1&utm_medium=distribute.pc_relevant.none-task-blog-2%7Edefault%7ECTRLIST%7Edefault-1-106692395-blog-123886480.pc_relevant_multi_platform_whitelistv1&depth_1-utm_source=distribute.pc_relevant.none-task-blog-2%7Edefault%7ECTRLIST%7Edefault-1-106692395-blog-123886480.pc_relevant_multi_platform_whitelistv1&utm_relevant_index=1

Jul 13 '22 11:07 RONINGOD

请问这个问题解决了吗，我也是报这样的错

Aug 01 '22 01:08 XuLei0

@XuLei0 我也遇到了，后面试了才发现我resume 使用的checkpoint有问题，换了一个就可以了。不过我还在看那个checkpoint为啥有问题，文件大小和其他的都一样。

Oct 14 '22 06:10 mengdongwei

python -m pip install paddlepaddle-gpu==2.4.0rc0.post112 -f https://www.paddlepaddle.org.cn/whl/windows/mkl/avx/stable.html 换个新的版本解决了这个问题 python=3.9 cuda=11.2 cudnn=8.2

Nov 20 '22 10:11 dongfeicui

win上请安装最新develop版本的paddle https://www.paddlepaddle.org.cn/install/quick?docurl=/documentation/docs/zh/develop/install/pip/windows-pip.html

Nov 20 '22 12:11 nemonameless

你好请问这个问题解决了吗？我报了跟你一模一样的问题，该怎么解决呢？

Nov 22 '22 03:11 mingheyuemankong

解决了是requirements.txt中一个包之前一直不能安装cython-bbox，需要修改配置文件在Win10:https://yuki-ho.blog.csdn.net/article/details/106692395?spm=1001.2101.3001.6650.1&utm_medium=distribute.pc_relevant.none-task-blog-2%7Edefault%7ECTRLIST%7Edefault-1-106692395-blog-123886480.pc_relevant_multi_platform_whitelistv1&depth_1-utm_source=distribute.pc_relevant.none-task-blog-2%7Edefault%7ECTRLIST%7Edefault-1-106692395-blog-123886480.pc_relevant_multi_platform_whitelistv1&utm_relevant_index=1

这跟问题没关系吧，没有用啊？

Nov 22 '22 03:11 mingheyuemankong

您好

问题确认 Search before asking

[x] 我已经查询历史issue，没有报过同样bug。I have searched the issues and found no similar bug report.

bug描述 Describe the Bug

python tools/train.py -c configs/retinanet/retinanet_r50_fpn_1x_coco.yml --eval -o use_gpu=ture

会报错： [07/05 21:38:33] ppdet.utils.checkpoint INFO: Finish loading model weights: C:\Users\Hubery/.cache/paddle/weights\ResNet50_cos_pretrained.pdparams Error: ../paddle/phi/kernels/funcs/gather.cu.h:67 Assertion index_value >= 0 && index_value < input_dims[j] failed. The index is out of bounds, please check whether the dimensions of index and input meet the requirements. It should be less than [182403] and greater than or equal to 0, but received [0] 有很多行跟这个一样的错误

Traceback (most recent call last): File "tools/train.py", line 177, in main() File "tools/train.py", line 173, in main run(FLAGS, cfg) File "tools/train.py", line 127, in run trainer.train(FLAGS.eval) File "D:\deep_learning\PaddleDetection\ppdet\engine\trainer.py", line 448, in train outputs = model(data) File "C:\Users\Hubery\anaconda3\envs\paddle_env\lib\site-packages\paddle\fluid\dygraph\layers.py", line 930, in call return self._dygraph_call_func(*inputs, **kwargs) File "C:\Users\Hubery\anaconda3\envs\paddle_env\lib\site-packages\paddle\fluid\dygraph\layers.py", line 915, in _dygraph_call_func outputs = self.forward(*inputs, **kwargs) File "D:\deep_learning\PaddleDetection\ppdet\modeling\architectures\meta_arch.py", line 59, in forward out = self.get_loss() File "D:\deep_learning\PaddleDetection\ppdet\modeling\architectures\retinanet.py", line 65, in get_loss return self._forward() File "D:\deep_learning\PaddleDetection\ppdet\modeling\architectures\retinanet.py", line 57, in _forward return self.head(neck_feats, self.inputs) File "C:\Users\Hubery\anaconda3\envs\paddle_env\lib\site-packages\paddle\fluid\dygraph\layers.py", line 930, in call return self._dygraph_call_func(*inputs, **kwargs) File "C:\Users\Hubery\anaconda3\envs\paddle_env\lib\site-packages\paddle\fluid\dygraph\layers.py", line 915, in _dygraph_call_func outputs = self.forward(*inputs, **kwargs) File "D:\deep_learning\PaddleDetection\ppdet\modeling\heads\retina_head.py", line 105, in forward return self.get_loss([cls_logits_list, bboxes_reg_list], targets) File "D:\deep_learning\PaddleDetection\ppdet\modeling\heads\retina_head.py", line 160, in get_loss cls_tar = gt_class[matches[chosen_mask]] File "C:\Users\Hubery\anaconda3\envs\paddle_env\lib\site-packages\paddle\fluid\dygraph\varbase_patch_methods.py", line 735, in getitem return getitem_impl(self, item) File "C:\Users\Hubery\anaconda3\envs\paddle_env\lib\site-packages\paddle\fluid\variable_index.py", line 430, in getitem_impl return get_value_for_bool_tensor(var, slice_item) File "C:\Users\Hubery\anaconda3\envs\paddle_env\lib\site-packages\paddle\fluid\variable_index.py", line 310, in get_value_for_bool_tensor lambda: idx_empty(var)) File "C:\Users\Hubery\anaconda3\envs\paddle_env\lib\site-packages\paddle\fluid\layers\control_flow.py", line 2452, in cond pred = pred.numpy()[0] OSError: (External) CUDA error(719), unspecified launch failure. [Hint: 'cudaErrorLaunchFailure'. An exception occurred on the device while executing a kernel. Common causes include dereferencing an invalid device pointerand accessing out of bounds shared memory. Less common cases can be system specific - more information about these cases canbe found in the system specific user guide. This leaves the process in an inconsistent state and any further CUDA work willreturn the same error. To continue using CUDA, the process must be terminated and relaunched.] (at ..\paddle\phi\backends\gpu\cuda\cuda_info.cc:258)

请问该如何解决呢？

复现环境 Environment

-PaddlePaddle 2.3.0.post112 -paddledet 2.4.0 -python -3.7 -cudatoolkit 11.2.2 -cudnn 8.2.1.32

显卡型号：RTX 3090

nvidia-smi

Tue Jul 5 21:12:59 2022 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 472.12 Driver Version: 472.12 CUDA Version: 11.4 | |-------------------------------+----------------------+----------------------+ | GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC | | 0 N/A N/A 13168 C+G ...lPanel\SystemSettings.exe N/A | | 0 N/A N/A 13228 C+G ...cw5n1h2txyewy\LockApp.exe N/A | | 0 N/A N/A 13284 C+G ...bbwe\Microsoft.Photos.exe N/A | | 0 N/A N/A 13572 C+G ...ge\Application\msedge.exe N/A | | 0 N/A N/A 18156 C+G ...2txyewy\TextInputHost.exe N/A | | 0 N/A N/A 18600 C+G ...mathpix-snipping-tool.exe N/A | | 0 N/A N/A 19792 C+G ...y\ShellExperienceHost.exe N/A | | 0 N/A N/A 20504 C+G ...1\jbr\bin\jcef_helper.exe N/A | | 0 N/A N/A 20676 C+G ...264.44\msedgewebview2.exe N/A | | 0 N/A N/A 20984 C+G ...8wekyb3d8bbwe\GameBar.exe N/A | +-----------------------------------------------------------------------------+

nvcc -V

nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2020 NVIDIA Corporation Built on Mon_Nov_30_19:15:10_Pacific_Standard_Time_2020 Cuda compilation tools, release 11.2, V11.2.67 Build cuda_11.2.r11.2/compiler.29373293_0

并且test_architectures.py和python tools/infer.py -c configs/ppyolo/ppyolo_r50vd_dcn_1x_coco.yml -o use_gpu=true weights=https://paddledet.bj.bcebos.com/models/ppyolo_r50vd_dcn_1x_coco.pdparams --infer_img=demo/000000014439.jpg测试通过。 ....... Ran 7 tests in 2.265s

OK

Done (t=0.75s) creating index... index created! 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 1.19it/s] [07/06 10:08:15] ppdet.engine INFO: Detection bbox results save in output\000000014439.jpg

跑yolov3就可以正常跑

python tools/train.py -c configs/yolov3/yolov3_darknet53_270e_coco.yml --eval -o use_gpu=ture

[07/06 10:39:58] ppdet.utils.checkpoint INFO: Finish loading model weights: C:\Users\Hubery/.cache/paddle/weights\DarkNet53_pretrained.pdparams [07/06 10:40:00] ppdet.engine INFO: Epoch: [0] [ 0/14658] learning_rate: 0.000000 loss_xy: 8.961815 loss_wh: 10.346878 loss_obj: 10171.745117 loss_cls: 193.526794 loss: 10384.580078 eta: 89 days, 16:38:52 batch_cost: 1.9581 data_cost: 0.0000 ips: 4.0856 images/s [07/06 10:40:11] ppdet.engine INFO: Epoch: [0] [ 20/14658] learning_rate: 0.000005 loss_xy: 17.073185 loss_wh: 17.941078 loss_obj: 831.352783 loss_cls: 329.363922 loss: 1140.583252 eta: 27 days, 14:10:25 batch_cost: 0.5345 data_cost: 0.3638 ips: 14.9660 images/s .................................

是否愿意提交PR Are you willing to submit a PR?

[ ] Yes I'd like to help by submitting a PR!

您好，请问，这个问题最后是怎么解决的？我也遇到了同样的问题

Aug 23 '23 03:08 Plusmile

PaddleDetection PaddleDetection copied to clipboard

在使用COCO数据集训练RetinaNet时（并未修改任何参数），出现OSError: (External) CUDA error(719), unspecified launch failure.

问题确认 Search before asking

bug描述 Describe the Bug

复现环境 Environment

是否愿意提交PR Are you willing to submit a PR?

问题确认 Search before asking

bug描述 Describe the Bug

复现环境 Environment

是否愿意提交PR Are you willing to submit a PR?

PaddleDetection
PaddleDetection copied to clipboard