PaddleX icon indicating copy to clipboard operation
PaddleX copied to clipboard

OSError: (External) CUDA error(700), an illegal memory access was encountered.

Open xxPete opened this issue 2 years ago • 3 comments

Checklist:

  1. 查找历史相关issue寻求解答
  2. 翻阅FAQ常见问题汇总和答疑
  3. 确认bug是否在新版本里还未修复
  4. 翻阅PaddleX 使用文档

描述问题

复现

  1. 您是否已经正常运行我们提供的教程
  • 是,可以正常运行
  1. 您是否在教程的基础上修改代码内容?还请您提供运行的代码
  • 没有
  1. 您使用的数据集是?
  • 小度熊的实例分割数据集
  1. 请提供您出现的报错信息及相关log
2022-10-09 09:05:30,360-WARNING: type object 'QuantizationTransformPass' has no attribute '_supported_quantizable_op_type'
2022-10-09 09:05:30,360-WARNING: If you want to use training-aware and post-training quantization, please use Paddle >= 1.8.4 or develop version
D:\Project\PaddleX\PaddleX-develop\paddlex\ppcls\data\preprocess\ops\timm_autoaugment.py:38: DeprecationWarning: BILINEAR is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.BILINEAR instead.
  _RANDOM_INTERPOLATION = (Image.BILINEAR, Image.BICUBIC)
D:\Project\PaddleX\PaddleX-develop\paddlex\ppcls\data\preprocess\ops\timm_autoaugment.py:38: DeprecationWarning: BICUBIC is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.BICUBIC instead.
  _RANDOM_INTERPOLATION = (Image.BILINEAR, Image.BICUBIC)
Warning: import ppdet from source directory without installing, run 'python setup.py install' to install ppdet firstly
loading annotations into memory...
Done (t=0.00s)
creating index...
index created!
2022-10-09 09:05:31 [INFO]      Starting to read file list from dataset...
2022-10-09 09:05:31 [INFO]      14 samples in file ./dataset/xiaoduxiong_ins_det/train.json, including 14 positive samples and 0 negative samples.
loading annotations into memory...
Done (t=0.00s)
creating index...
index created!
2022-10-09 09:05:31 [INFO]      Starting to read file list from dataset...
2022-10-09 09:05:31 [INFO]      4 samples in file ./dataset/xiaoduxiong_ins_det/val.json, including 4 positive samples and 0 negative samples.
W1009 09:05:31.109730 19380 gpu_resources.cc:61] Please NOTE: device: 0, GPU Compute Capability: 8.6, Driver API Version: 11.6, Runtime API Version: 11.6
W1009 09:05:31.112730 19380 gpu_resources.cc:91] device: 0, cuDNN Version: 8.6.
2022-10-09 09:05:31 [INFO]      Loading pretrained model from output/mask_rcnn_r50_fpn\pretrain\mask_rcnn_r50_fpn_2x_coco.pdparams
2022-10-09 09:05:32 [WARNING]   [SKIP] Shape of pretrained params bbox_head.bbox_score.weight doesn't match.(Pretrained: [1024, 81], Actual: [1024, 2])
2022-10-09 09:05:32 [WARNING]   [SKIP] Shape of pretrained params bbox_head.bbox_score.bias doesn't match.(Pretrained: [81], Actual: [2])
2022-10-09 09:05:32 [WARNING]   [SKIP] Shape of pretrained params bbox_head.bbox_delta.weight doesn't match.(Pretrained: [1024, 320], Actual: [1024, 4])
2022-10-09 09:05:32 [WARNING]   [SKIP] Shape of pretrained params bbox_head.bbox_delta.bias doesn't match.(Pretrained: [320], Actual: [4])
2022-10-09 09:05:32 [WARNING]   [SKIP] Shape of pretrained params mask_head.mask_fcn_logits.weight doesn't match.(Pretrained: [80, 256, 1, 1], Actual: [1, 256, 1, 1])
2022-10-09 09:05:32 [WARNING]   [SKIP] Shape of pretrained params mask_head.mask_fcn_logits.bias doesn't match.(Pretrained: [80], Actual: [1])
2022-10-09 09:05:32 [INFO]      There are 301/307 variables loaded into MaskRCNN.
Traceback (most recent call last):
  File ".\train_xiaodu.py", line 40, in <module>
    use_vdl=False)
  File "D:\Project\PaddleX\PaddleX-develop\paddlex\cv\models\detector.py", line 2188, in train
    early_stop_patience, use_vdl, resume_checkpoint)
  File "D:\Project\PaddleX\PaddleX-develop\paddlex\cv\models\detector.py", line 334, in train
    use_vdl=use_vdl)
  File "D:\Project\PaddleX\PaddleX-develop\paddlex\cv\models\base.py", line 339, in train_loop
    outputs = self.run(self.net, data, mode='train')
  File "D:\Project\PaddleX\PaddleX-develop\paddlex\cv\models\detector.py", line 105, in run
    net_out = net(inputs)
  File "D:\Project\PaddleX\PaddleX-develop\venv\lib\site-packages\paddle\fluid\dygraph\layers.py", line 930, in __call__
    return self._dygraph_call_func(*inputs, **kwargs)
  File "D:\Project\PaddleX\PaddleX-develop\venv\lib\site-packages\paddle\fluid\dygraph\layers.py", line 915, in _dygraph_call_func
    outputs = self.forward(*inputs, **kwargs)
  File "D:\Project\PaddleX\PaddleX-develop\paddlex\ppdet\modeling\architectures\meta_arch.py", line 59, in forward
    out = self.get_loss()
  File "D:\Project\PaddleX\PaddleX-develop\paddlex\ppdet\modeling\architectures\mask_rcnn.py", line 123, in get_loss
    bbox_loss, mask_loss, rpn_loss = self._forward()
  File "D:\Project\PaddleX\PaddleX-develop\paddlex\ppdet\modeling\architectures\mask_rcnn.py", line 93, in _forward
    rois, rois_num, rpn_loss = self.rpn_head(body_feats, self.inputs)
  File "D:\Project\PaddleX\PaddleX-develop\venv\lib\site-packages\paddle\fluid\dygraph\layers.py", line 930, in __call__
    return self._dygraph_call_func(*inputs, **kwargs)
  File "D:\Project\PaddleX\PaddleX-develop\venv\lib\site-packages\paddle\fluid\dygraph\layers.py", line 915, in _dygraph_call_func
    outputs = self.forward(*inputs, **kwargs)
  File "D:\Project\PaddleX\PaddleX-develop\paddlex\ppdet\modeling\proposal_generator\rpn_head.py", line 140, in forward
    loss = self.get_loss(scores, deltas, anchors, inputs)
  File "D:\Project\PaddleX\PaddleX-develop\paddlex\ppdet\modeling\proposal_generator\rpn_head.py", line 278, in get_loss
    pos_ind = paddle.nonzero(pos_mask)
  File "D:\Project\PaddleX\PaddleX-develop\venv\lib\site-packages\paddle\tensor\search.py", line 402, in nonzero
    outs = _C_ops.where_index(x)
OSError: (External) CUDA error(700), an illegal memory access was encountered.
  [Hint: 'cudaErrorIllegalAddress'. The device encountered a load or store instruction on an invalid memory address. This leaves the process in an inconsistentstate and any further CUDA work will return the same error. To continue u
sing CUDA, the process must be terminated and relaunched. ] (at ..\paddle\phi\backends\gpu\cuda\cuda_info.cc:251)
  [operator < where_index > error]

环境

  1. 请提供您使用的PaddlePaddle和PaddleX的版本号
  • paddlepaddle-gpu 2.3.2.post116
  • paddlex 2.1.0
  1. 请提供您使用的操作系统信息,如Linux/Windows/MacOS
  • Windows
  1. 请问您使用的Python版本是?
  • 3.7
  1. 请问您使用的CUDA/cuDNN的版本号是?
  • 11.6/8.6

xxPete avatar Oct 09 '22 01:10 xxPete

  • 补充一下debug后出现的信息
Error: ../paddle/phi/kernels/funcs/scatter.cu.h:66 Assertion `scatter_i >= 0` failed. The index is out of bounds, please check whether the dimensions of index and input meet the requirements. It should be greater than or equal to 0, but received [-1118890112],几百条都是这个

xxPete avatar Oct 09 '22 01:10 xxPete

paddlepaddle-gpu 2.1.3.post112 可以解决问题

SUNbrightness avatar Dec 27 '22 09:12 SUNbrightness

我做分割任务用DeepLabV3P模型也遇到相同报错,设置use_mixed_loss = false后报错消失,貌似deeplab3p不能用混合损失函数。 本人环境:win10, paddle-gpu 2.3.2 post112
paddlex 2.1.0

keepgoing365 avatar May 17 '23 09:05 keepgoing365