PaddleX OSError: (External) CUDA error(700), an illegal memory access was encountered.

Checklist:

查找历史相关issue寻求解答
翻阅FAQ常见问题汇总和答疑
确认bug是否在新版本里还未修复
翻阅PaddleX 使用文档

描述问题

根据https://aistudio.baidu.com/aistudio/projectdetail/4398052?channelType=0&channel=0这个项目复现的，在aistudio上正常训练，到本地就有问题，显存够的

复现

您是否已经正常运行我们提供的教程？

是，可以正常运行

您是否在教程的基础上修改代码内容？还请您提供运行的代码

没有

您使用的数据集是？

小度熊的实例分割数据集

请提供您出现的报错信息及相关log

2022-10-09 09:05:30,360-WARNING: type object 'QuantizationTransformPass' has no attribute '_supported_quantizable_op_type'
2022-10-09 09:05:30,360-WARNING: If you want to use training-aware and post-training quantization, please use Paddle >= 1.8.4 or develop version
D:\Project\PaddleX\PaddleX-develop\paddlex\ppcls\data\preprocess\ops\timm_autoaugment.py:38: DeprecationWarning: BILINEAR is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.BILINEAR instead.
  _RANDOM_INTERPOLATION = (Image.BILINEAR, Image.BICUBIC)
D:\Project\PaddleX\PaddleX-develop\paddlex\ppcls\data\preprocess\ops\timm_autoaugment.py:38: DeprecationWarning: BICUBIC is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.BICUBIC instead.
  _RANDOM_INTERPOLATION = (Image.BILINEAR, Image.BICUBIC)
Warning: import ppdet from source directory without installing, run 'python setup.py install' to install ppdet firstly
loading annotations into memory...
Done (t=0.00s)
creating index...
index created!
2022-10-09 09:05:31 [INFO]      Starting to read file list from dataset...
2022-10-09 09:05:31 [INFO]      14 samples in file ./dataset/xiaoduxiong_ins_det/train.json, including 14 positive samples and 0 negative samples.
loading annotations into memory...
Done (t=0.00s)
creating index...
index created!
2022-10-09 09:05:31 [INFO]      Starting to read file list from dataset...
2022-10-09 09:05:31 [INFO]      4 samples in file ./dataset/xiaoduxiong_ins_det/val.json, including 4 positive samples and 0 negative samples.
W1009 09:05:31.109730 19380 gpu_resources.cc:61] Please NOTE: device: 0, GPU Compute Capability: 8.6, Driver API Version: 11.6, Runtime API Version: 11.6
W1009 09:05:31.112730 19380 gpu_resources.cc:91] device: 0, cuDNN Version: 8.6.
2022-10-09 09:05:31 [INFO]      Loading pretrained model from output/mask_rcnn_r50_fpn\pretrain\mask_rcnn_r50_fpn_2x_coco.pdparams
2022-10-09 09:05:32 [WARNING]   [SKIP] Shape of pretrained params bbox_head.bbox_score.weight doesn't match.(Pretrained: [1024, 81], Actual: [1024, 2])
2022-10-09 09:05:32 [WARNING]   [SKIP] Shape of pretrained params bbox_head.bbox_score.bias doesn't match.(Pretrained: [81], Actual: [2])
2022-10-09 09:05:32 [WARNING]   [SKIP] Shape of pretrained params bbox_head.bbox_delta.weight doesn't match.(Pretrained: [1024, 320], Actual: [1024, 4])
2022-10-09 09:05:32 [WARNING]   [SKIP] Shape of pretrained params bbox_head.bbox_delta.bias doesn't match.(Pretrained: [320], Actual: [4])
2022-10-09 09:05:32 [WARNING]   [SKIP] Shape of pretrained params mask_head.mask_fcn_logits.weight doesn't match.(Pretrained: [80, 256, 1, 1], Actual: [1, 256, 1, 1])
2022-10-09 09:05:32 [WARNING]   [SKIP] Shape of pretrained params mask_head.mask_fcn_logits.bias doesn't match.(Pretrained: [80], Actual: [1])
2022-10-09 09:05:32 [INFO]      There are 301/307 variables loaded into MaskRCNN.
Traceback (most recent call last):
  File ".\train_xiaodu.py", line 40, in <module>
    use_vdl=False)
  File "D:\Project\PaddleX\PaddleX-develop\paddlex\cv\models\detector.py", line 2188, in train
    early_stop_patience, use_vdl, resume_checkpoint)
  File "D:\Project\PaddleX\PaddleX-develop\paddlex\cv\models\detector.py", line 334, in train
    use_vdl=use_vdl)
  File "D:\Project\PaddleX\PaddleX-develop\paddlex\cv\models\base.py", line 339, in train_loop
    outputs = self.run(self.net, data, mode='train')
  File "D:\Project\PaddleX\PaddleX-develop\paddlex\cv\models\detector.py", line 105, in run
    net_out = net(inputs)
  File "D:\Project\PaddleX\PaddleX-develop\venv\lib\site-packages\paddle\fluid\dygraph\layers.py", line 930, in __call__
    return self._dygraph_call_func(*inputs, **kwargs)
  File "D:\Project\PaddleX\PaddleX-develop\venv\lib\site-packages\paddle\fluid\dygraph\layers.py", line 915, in _dygraph_call_func
    outputs = self.forward(*inputs, **kwargs)
  File "D:\Project\PaddleX\PaddleX-develop\paddlex\ppdet\modeling\architectures\meta_arch.py", line 59, in forward
    out = self.get_loss()
  File "D:\Project\PaddleX\PaddleX-develop\paddlex\ppdet\modeling\architectures\mask_rcnn.py", line 123, in get_loss
    bbox_loss, mask_loss, rpn_loss = self._forward()
  File "D:\Project\PaddleX\PaddleX-develop\paddlex\ppdet\modeling\architectures\mask_rcnn.py", line 93, in _forward
    rois, rois_num, rpn_loss = self.rpn_head(body_feats, self.inputs)
  File "D:\Project\PaddleX\PaddleX-develop\venv\lib\site-packages\paddle\fluid\dygraph\layers.py", line 930, in __call__
    return self._dygraph_call_func(*inputs, **kwargs)
  File "D:\Project\PaddleX\PaddleX-develop\venv\lib\site-packages\paddle\fluid\dygraph\layers.py", line 915, in _dygraph_call_func
    outputs = self.forward(*inputs, **kwargs)
  File "D:\Project\PaddleX\PaddleX-develop\paddlex\ppdet\modeling\proposal_generator\rpn_head.py", line 140, in forward
    loss = self.get_loss(scores, deltas, anchors, inputs)
  File "D:\Project\PaddleX\PaddleX-develop\paddlex\ppdet\modeling\proposal_generator\rpn_head.py", line 278, in get_loss
    pos_ind = paddle.nonzero(pos_mask)
  File "D:\Project\PaddleX\PaddleX-develop\venv\lib\site-packages\paddle\tensor\search.py", line 402, in nonzero
    outs = _C_ops.where_index(x)
OSError: (External) CUDA error(700), an illegal memory access was encountered.
  [Hint: 'cudaErrorIllegalAddress'. The device encountered a load or store instruction on an invalid memory address. This leaves the process in an inconsistentstate and any further CUDA work will return the same error. To continue u
sing CUDA, the process must be terminated and relaunched. ] (at ..\paddle\phi\backends\gpu\cuda\cuda_info.cc:251)
  [operator < where_index > error]

环境

请提供您使用的PaddlePaddle和PaddleX的版本号

paddlepaddle-gpu 2.3.2.post116
paddlex 2.1.0

请提供您使用的操作系统信息，如Linux/Windows/MacOS

Windows

请问您使用的Python版本是？

3.7

请问您使用的CUDA/cuDNN的版本号是？

11.6/8.6

Oct 09 '22 01:10 xxPete

补充一下debug后出现的信息

Error: ../paddle/phi/kernels/funcs/scatter.cu.h:66 Assertion `scatter_i >= 0` failed. The index is out of bounds, please check whether the dimensions of index and input meet the requirements. It should be greater than or equal to 0, but received [-1118890112],几百条都是这个

Oct 09 '22 01:10 xxPete

paddlepaddle-gpu 2.1.3.post112 可以解决问题

Dec 27 '22 09:12 SUNbrightness

我做分割任务用DeepLabV3P模型也遇到相同报错，设置use_mixed_loss = false后报错消失，貌似deeplab3p不能用混合损失函数。本人环境:win10, paddle-gpu 2.3.2 post112
paddlex 2.1.0

May 17 '23 09:05 keepgoing365

PaddleX PaddleX copied to clipboard

OSError: (External) CUDA error(700), an illegal memory access was encountered.

Checklist:

描述问题

复现

环境

PaddleX
PaddleX copied to clipboard