PaddleX icon indicating copy to clipboard operation
PaddleX copied to clipboard

明明显存还有很多,但他报错OSError: (External) CUDA error(700), an illegal memory access was encountered.

Open andeluleidisi opened this issue 3 years ago • 3 comments

Checklist:

  1. 查找历史相关issue寻求解答
  2. 翻阅FAQ常见问题汇总和答疑
  3. 确认bug是否在新版本里还未修复
  4. 翻阅PaddleX 使用文档

描述问题

我用paddlex里面目标检测的案例程序 改了一下数据处理,主要是图片尺寸大小 后面训练批次的大小也该小为4了 但出现了 OSError: (External) CUDA error(700), an illegal memory access was encountered. [Hint: 'cudaErrorIllegalAddress'. The device encountered a load or store instruction on an invalid memory address. This leaves the process in an inconsistentstate and any further CUDA work will return the same error. To continue using CUDA, the process must be terminated and relaunched. ] (at ..\paddle\phi\backends\gpu\cuda\cuda_info.cc:258)

复现

  1. 您是否已经正常运行我们提供的教程? 不能
  2. 您是否在教程的基础上修改代码内容?还请您提供运行的代码 改了
import paddlex as pdx
from paddlex import transforms as T

# 下载和解压昆虫检测数据集
dataset = 'https://bj.bcebos.com/paddlex/datasets/insect_det.tar.gz'
pdx.utils.download_and_decompress(dataset, path='./')

# 定义训练和验证时的transforms
# API说明:https://github.com/PaddlePaddle/PaddleX/blob/develop/docs/apis/transforms/transforms.md
train_transforms = T.Compose([T.Resize([320,320],interp='CUBIC'),T.RandomVerticalFlip(),T.RandomCrop(),T.Normalize()])

eval_transforms = T.Compose([T.Resize([320,320],interp='CUBIC'),T.RandomHorizontalFlip(),T.CenterCrop(),T.Normalize()])


# 定义训练和验证所用的数据集
# API说明:https://github.com/PaddlePaddle/PaddleX/blob/develop/docs/apis/datasets.md
train_dataset = pdx.datasets.VOCDetection(
    data_dir='insect_det',
    file_list='insect_det/train_list.txt',
    label_list='insect_det/labels.txt',
    transforms=train_transforms,
    shuffle=True)

eval_dataset = pdx.datasets.VOCDetection(
    data_dir='insect_det',
    file_list='insect_det/val_list.txt',
    label_list='insect_det/labels.txt',
    transforms=eval_transforms,
    shuffle=False)

# 初始化模型,并进行训练
# 可使用VisualDL查看训练指标,参考https://github.com/PaddlePaddle/PaddleX/blob/develop/docs/visualdl.md
num_classes = len(train_dataset.labels)
model = pdx.det.PicoDet(num_classes=num_classes)

# API说明:https://github.com/PaddlePaddle/PaddleX/blob/develop//docs/apis/models/detection.md
# 各参数介绍与调整说明:https://github.com/PaddlePaddle/PaddleX/blob/develop//docs/parameters.md
model.train(
    num_epochs=270,
    train_dataset=train_dataset,
    train_batch_size=4,
    eval_dataset=eval_dataset,
    learning_rate=0.001 / 8,
    warmup_steps=1000,
    warmup_start_lr=0.0,
    save_interval_epochs=5,
    lr_decay_epochs=[216, 243],
    save_dir='output/yolov3_darknet53')
  1. 您使用的数据集是? 案例程序自己下的
  2. 请提供您出现的报错信息及相关log OSError: (External) CUDA error(700), an illegal memory access was encountered. [Hint: 'cudaErrorIllegalAddress'. The device encountered a load or store instruction on an invalid memory address. This leaves the process in an inconsistentstate and any further CUDA work will return the same error. To continue using CUDA, the process must be terminated and relaunched. ] (at ..\paddle\phi\backends\gpu\cuda\cuda_info.cc:258)

环境

  1. 请提供您使用的PaddlePaddle和PaddleX的版本号 paddlepaddle-gpu2.3.1.post116 paddle2.1.0
  2. 请提供您使用的操作系统信息,如Linux/Windows/MacOS windows
  3. 请问您使用的Python版本是? python3.7
  4. 请问您使用的CUDA/cuDNN的版本号是? CUDA11.6 `cuDNN8.4

andeluleidisi avatar Jul 11 '22 04:07 andeluleidisi

我后面发现只有用Picodet模型才会报错

andeluleidisi avatar Jul 11 '22 11:07 andeluleidisi

降低cuda版本到11.2试试

lailuboy avatar Aug 04 '22 03:08 lailuboy

python OK.py

2022-08-22 17:16:52,476-WARNING: type object 'QuantizationTransformPass' has no attribute '_supported_quantizable_op_type'
2022-08-22 17:16:52,476-WARNING: If you want to use training-aware and post-training quantization, please use Paddle >= 1.8.4 or develop version
D:\anaconda308\envs\paddleX2022\lib\site-packages\paddlex\ppcls\data\preprocess\ops\timm_autoaugment.py:38: DeprecationWarning: BILINEAR is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.BILINEAR instead.
  _RANDOM_INTERPOLATION = (Image.BILINEAR, Image.BICUBIC)
D:\anaconda308\envs\paddleX2022\lib\site-packages\paddlex\ppcls\data\preprocess\ops\timm_autoaugment.py:38: DeprecationWarning: BICUBIC is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.BICUBIC instead.
  _RANDOM_INTERPOLATION = (Image.BILINEAR, Image.BICUBIC)
2022-08-22 17:16:53 [INFO]      Starting to read file list from dataset...
2022-08-22 17:16:57 [INFO]      3399 samples in file D:\paddlex_workspace08\datasets\D0039\train_list.txt, including 3399 positive samples and 0 negative samples.
creating index...
index created!
2022-08-22 17:16:57 [INFO]      Starting to read file list from dataset...
2022-08-22 17:16:58 [INFO]      970 samples in file D:\paddlex_workspace08\datasets\D0039\val_list.txt, including 970 positive samples and 0 negative samples.
creating index...
index created!
2022-08-22 17:17:02 [INFO]      5555 negative samples added. Dataset contains 3399 positive samples and 5555 negative samples.
W0822 17:17:02.240533  9516 gpu_resources.cc:61] Please NOTE: device: 0, GPU Compute Capability: 8.6, Driver API Version: 11.2, Runtime API Version: 11.2
W0822 17:17:02.256193  9516 gpu_resources.cc:91] device: 0, cuDNN Version: 8.2.
2022-08-22 17:17:02 [INFO]      Loading pretrained model from D:\paddlex_workspace08\projects\P0037\T0085\output\pretrain\picodet_lcnet_1_5x_416_coco.pdparams
2022-08-22 17:17:02 [WARNING]   [SKIP] Shape of pretrained params head.head_cls0.weight doesn't match.(Pretrained: [112, 128, 1, 1], Actual: [36, 128, 1, 1])
2022-08-22 17:17:02 [WARNING]   [SKIP] Shape of pretrained params head.head_cls0.bias doesn't match.(Pretrained: [112], Actual: [36])
2022-08-22 17:17:02 [WARNING]   [SKIP] Shape of pretrained params head.head_cls1.weight doesn't match.(Pretrained: [112, 128, 1, 1], Actual: [36, 128, 1, 1])
2022-08-22 17:17:02 [WARNING]   [SKIP] Shape of pretrained params head.head_cls1.bias doesn't match.(Pretrained: [112], Actual: [36])
2022-08-22 17:17:02 [WARNING]   [SKIP] Shape of pretrained params head.head_cls2.weight doesn't match.(Pretrained: [112, 128, 1, 1], Actual: [36, 128, 1, 1])
2022-08-22 17:17:02 [WARNING]   [SKIP] Shape of pretrained params head.head_cls2.bias doesn't match.(Pretrained: [112], Actual: [36])
2022-08-22 17:17:02 [WARNING]   [SKIP] Shape of pretrained params head.head_cls3.weight doesn't match.(Pretrained: [112, 128, 1, 1], Actual: [36, 128, 1, 1])
2022-08-22 17:17:02 [WARNING]   [SKIP] Shape of pretrained params head.head_cls3.bias doesn't match.(Pretrained: [112], Actual: [36])
2022-08-22 17:17:02 [INFO]      There are 483/491 variables loaded into PicoDet.
Traceback (most recent call last):
  File "tranOK.py", line 95, in <module>
    resume_checkpoint=None)
  File "D:\anaconda308\envs\paddleX2022\lib\site-packages\paddlex\cv\models\detector.py", line 910, in train
    resume_checkpoint=resume_checkpoint)
  File "D:\anaconda308\envs\paddleX2022\lib\site-packages\paddlex\cv\models\detector.py", line 334, in train
    use_vdl=use_vdl)
  File "D:\anaconda308\envs\paddleX2022\lib\site-packages\paddlex\cv\models\base.py", line 337, in train_loop
    outputs = self.run(self.net, data, mode='train')
  File "D:\anaconda308\envs\paddleX2022\lib\site-packages\paddlex\cv\models\detector.py", line 105, in run
    net_out = net(inputs)
  File "D:\anaconda308\envs\paddleX2022\lib\site-packages\paddle\fluid\dygraph\layers.py", line 930, in __call__
    return self._dygraph_call_func(*inputs, **kwargs)
  File "D:\anaconda308\envs\paddleX2022\lib\site-packages\paddle\fluid\dygraph\layers.py", line 915, in _dygraph_call_func
    outputs = self.forward(*inputs, **kwargs)
  File "D:\anaconda308\envs\paddleX2022\lib\site-packages\paddlex\ppdet\modeling\architectures\meta_arch.py", line 59, in forward
    out = self.get_loss()
  File "D:\anaconda308\envs\paddleX2022\lib\site-packages\paddlex\ppdet\modeling\architectures\picodet.py", line 79, in get_loss
    loss_gfl = self.head.get_loss(head_outs, self.inputs)
  File "D:\anaconda308\envs\paddleX2022\lib\site-packages\paddlex\ppdet\modeling\heads\simota_head.py", line 384, in get_loss
    gt_box, gt_label)
  File "D:\anaconda308\envs\paddleX2022\lib\site-packages\paddlex\ppdet\modeling\heads\simota_head.py", line 113, in _get_target_single
    flatten_bbox, gt_bboxes, gt_labels)
  File "D:\anaconda308\envs\paddleX2022\lib\site-packages\paddlex\ppdet\modeling\assigners\simota_assigner.py", line 189, in __call__
    gt_bboxes)  # [num_points,num_gts]
  File "D:\anaconda308\envs\paddleX2022\lib\site-packages\paddlex\ppdet\modeling\bbox_utils.py", line 227, in batch_bbox_overlaps
    eps = paddle.to_tensor([eps])
  File "D:\anaconda308\envs\paddleX2022\lib\site-packages\decorator.py", line 232, in fun
    return caller(func, *(extras + args), **kw)
  File "D:\anaconda308\envs\paddleX2022\lib\site-packages\paddle\fluid\wrapped_decorator.py", line 25, in __impl__
    return wrapped_func(*args, **kwargs)
  File "D:\anaconda308\envs\paddleX2022\lib\site-packages\paddle\fluid\framework.py", line 434, in __impl__
    return func(*args, **kwargs)
  File "D:\anaconda308\envs\paddleX2022\lib\site-packages\paddle\tensor\creation.py", line 189, in to_tensor
    stop_gradient=stop_gradient)
OSError: (External) CUDA error(700), an illegal memory access was encountered.
  [Hint: 'cudaErrorIllegalAddress'. The device encountered a load or store instruction on an invalid memory address. This leaves the process in an inconsistentstate and any further CUDA work will return the same error. To continue using CUDA, the process must be terminated and relaunched. ] (at ..\paddle\phi\backends\gpu\cuda\cuda_info.cc:258)
paddlepaddle-gpu   2.3.2.post112
paddlex            2.1.0
cuda 11.2
cuDNN Version: 8.2
python 3.7

训练Picodet模型 我也是出现这个问题

训练其他模型 没问题 只有训练Picodet模型 才会有问题!!!! @lailuboy

兄弟解决没 @andeluleidisi

monkeycc avatar Aug 22 '22 09:08 monkeycc

picodet 一样的问题

light201212 avatar Oct 27 '22 06:10 light201212