PaddleDetection icon indicating copy to clipboard operation
PaddleDetection copied to clipboard

coco数据集yolov5可以正常训练,yolox模型训练报错 cudaErrorIllegalAddress

Open wjdy opened this issue 2 years ago • 10 comments

问题确认 Search before asking

  • [X] 我已经搜索过问题,但是没有找到解答。I have searched the question and found no related answer.

请提出你的问题 Please ask your question

coco数据集yolov5可以正常训练,yolox模型训练报错无法定位错误原因,查看显存没有溢出

loading annotations into memory... Done (t=14.63s) creating index... index created! [08/17 14:27:03] ppdet.data.source.coco WARNING: Found an invalid bbox in annotations: im_id: 200365, area: 0.0 x1: 296.65, y1: 388.33, x2: 297.67999999999995, y2: 388.33. [08/17 14:27:14] ppdet.data.source.coco WARNING: Found an invalid bbox in annotations: im_id: 550395, area: 0.0 x1: 9.98, y1: 188.56, x2: 15.52, y2: 188.56. W0817 14:27:17.776957 512 gpu_resources.cc:61] Please NOTE: device: 0, GPU Compute Capability: 7.5, Driver API Version: 11.6, Runtime API Version: 11.2 W0817 14:27:17.800897 512 gpu_resources.cc:91] device: 0, cuDNN Version: 8.2. Traceback (most recent call last): File "tools/train.py", line 178, in main() File "tools/train.py", line 174, in main run(FLAGS, cfg) File "tools/train.py", line 136, in run trainer.train(FLAGS.eval) File "E:\AI_Code\PaddleDetection_YOLOSeries-develop\ppdet\engine\trainer.py", line 487, in train outputs = model(data) File "D:\Users\Aorus\miniconda3\envs\paddle_env\lib\site-packages\paddle\fluid\dygraph\layers.py", line 930, in call return self._dygraph_call_func(*inputs, **kwargs) File "D:\Users\Aorus\miniconda3\envs\paddle_env\lib\site-packages\paddle\fluid\dygraph\layers.py", line 915, in _dygraph_call_func outputs = self.forward(*inputs, **kwargs) File "E:\AI_Code\PaddleDetection_YOLOSeries-develop\ppdet\modeling\architectures\meta_arch.py", line 59, in forward out = self.get_loss() File "E:\AI_Code\PaddleDetection_YOLOSeries-develop\ppdet\modeling\architectures\yolox.py", line 105, in get_loss return self._forward() File "E:\AI_Code\PaddleDetection_YOLOSeries-develop\ppdet\modeling\architectures\yolox.py", line 95, in _forward yolox_losses = self.head(neck_feats, self.inputs) File "D:\Users\Aorus\miniconda3\envs\paddle_env\lib\site-packages\paddle\fluid\dygraph\layers.py", line 930, in call return self._dygraph_call_func(*inputs, **kwargs) File "D:\Users\Aorus\miniconda3\envs\paddle_env\lib\site-packages\paddle\fluid\dygraph\layers.py", line 915, in _dygraph_call_func outputs = self.forward(*inputs, **kwargs) File "E:\AI_Code\PaddleDetection_YOLOSeries-develop\ppdet\modeling\heads\yolo_head.py", line 301, in forward ], targets) File "E:\AI_Code\PaddleDetection_YOLOSeries-develop\ppdet\modeling\heads\yolo_head.py", line 321, in get_loss pred_score, center_and_strides, pred_bbox, gt_box, gt_label) File "E:\AI_Code\PaddleDetection_YOLOSeries-develop\ppdet\modeling\assigners\simota_assigner.py", line 189, in call gt_bboxes) # [num_points,num_gts] File "E:\AI_Code\PaddleDetection_YOLOSeries-develop\ppdet\modeling\bbox_utils.py", line 227, in batch_bbox_overlaps eps = paddle.to_tensor([eps]) File "D:\Users\Aorus\miniconda3\envs\paddle_env\lib\site-packages\decorator.py", line 232, in fun return caller(func, *(extras + args), **kw) File "D:\Users\Aorus\miniconda3\envs\paddle_env\lib\site-packages\paddle\fluid\wrapped_decorator.py", line 25, in impl return wrapped_func(*args, **kwargs) File "D:\Users\Aorus\miniconda3\envs\paddle_env\lib\site-packages\paddle\fluid\framework.py", line 434, in impl return func(*args, **kwargs) File "D:\Users\Aorus\miniconda3\envs\paddle_env\lib\site-packages\paddle\tensor\creation.py", line 189, in to_tensor stop_gradient=stop_gradient) OSError: (External) CUDA error(700), an illegal memory access was encountered. [Hint: 'cudaErrorIllegalAddress'. The device encountered a load or store instruction on an invalid memory address. This leaves the process in an inconsistentstate and any further CUDA work will return the same error. To continue using CUDA, the process must be terminated and relaunched. ] (at ..\paddle\phi\backends\gpu\cuda\cuda_info.cc:258)

wjdy avatar Aug 17 '22 06:08 wjdy

可能是环境版本问题。paddle版本是多少呢? 我训过2.2.2 2.3.0 2.3.1均没有问题。 PaddleDetection_YOLOSeries 的问题也可到这里提问 https://github.com/nemonameless/PaddleDetection_YOLOSeries/issues

nemonameless avatar Aug 17 '22 10:08 nemonameless

当前环境是paddle版本是2.3.1 cuda11.2, 之前也试过cuda10.2和paddle 2.2.2版本模型训练时会自动退出(开始训练后就退出,并且看不到任何报错信息,问题很奇怪)

wjdy avatar Aug 17 '22 11:08 wjdy

当前环境是paddle版本是2.3.1 cuda11.2, 之前也试过cuda10.2和paddle 2.2.2版本模型训练时会自动退出(开始训练后就退出,并且看不到任何报错信息,问题很奇怪)

paddle版本安装后请测试下输出版本号和GPU数是否正常。

import paddle
paddle.__version__
paddle.utils.run_check()

nemonameless avatar Aug 21 '22 13:08 nemonameless

我用coco数据集,也是同样的报错: OSError: (External) CUDA error(700), an illegal memory access was encountered. [Hint: 'cudaErrorIllegalAddress'. The device encountered a load or store instruction on an invalid memory address. This leaves the process in an inconsistentstate and any further CUDA work will return the same error. To continue using CUDA, the process must be terminated and relaunched. ] (at ..\paddle\phi\backends\gpu\cuda\cuda_info.cc:258) 前面的box warning不知道重不重要,上面的兄弟的报错信息里也有: ppdet.data.source.coco WARNING: Found an invalid bbox in annotations: im_id: 200365, area: 0.0 x1: 296.65, y1: 388.33, x2: 297.67999999999995, y2: 388.33. [08/22 08:23:50] ppdet.data.source.coco WARNING: Found an invalid bbox in annotations: im_id: 550395, area: 0.0 x1: 9.98, y1: 188.56, x2: 15.52, y2: 188.56. 跑paddle的GPU测试脚本和paddledetection的测试脚本都没有问题。 我的环境是: cuda 11.6 paddle 2.3.1 安装命令:pip install paddlepaddle-gpu==2.3.1.post116 -f https://www.paddlepaddle.org.cn/whl/windows/mkl/avx/stable.html paddledetection release/2.4

LLsmile avatar Aug 22 '22 00:08 LLsmile

切到develop分支还是有这个问题,不过使用CPU训练就不会报错了。

LLsmile avatar Aug 22 '22 01:08 LLsmile

paddle2.3.2也能顺利训练运行的,我使用的是cuda 10.1版本。

nemonameless avatar Aug 23 '22 02:08 nemonameless

切到develop分支还是有这个问题,不过使用CPU训练就不会报错了。

你好,请问你找到解决办法了么,我的error跟你的一样。

qinlihaoWork avatar Aug 25 '22 01:08 qinlihaoWork

也可以换paddle 2.2.2训

nemonameless avatar Aug 25 '22 02:08 nemonameless

在AI Studio创建项目同样的环境配置可以正常训练,自己主机训练报错遇到的问题尝试了几天也没解决,后面直接使用BML CodeLab

wjdy avatar Aug 25 '22 03:08 wjdy

确实是版本问题 paddlepaddle-gpu==2.2.2才可以 最新2.3报错

monkeycc avatar Sep 20 '22 12:09 monkeycc