mmrotate [Bug]" CUDA error: an illegal memory access was encountered" when I evaluate the result of the Rotated-RepPoint

Prerequisite

[X] I have searched Issues and Discussions but cannot get the expected help.
[X] I have read the FAQ documentation but cannot get the expected help.
[X] The bug has not been fixed in the latest version (master) or latest version (1.x).

Task

I have modified the scripts/configs, or I'm working on my own tasks/models/datasets.

Branch

master branch https://github.com/open-mmlab/mmrotate

Environment

sys.platform: linux Python: 3.9.16 (main, Mar 8 2023, 14:00:05) [GCC 11.2.0] CUDA available: True GPU 0,1: GeForce RTX 2080 Ti CUDA_HOME: /usr/local/cuda NVCC: Cuda compilation tools, release 10.1, V10.1.24 GCC: gcc (Ubuntu 5.4.0-6ubuntu1~16.04.12) 5.4.0 20160609 PyTorch: 1.7.1 PyTorch compiling details: PyTorch built with:

GCC 7.3
C++ Version: 201402
Intel(R) oneAPI Math Kernel Library Version 2021.4-Product Build 20210904 for Intel(R) 64 architecture applications
Intel(R) MKL-DNN v1.6.0 (Git Hash 5ef631a030a6f73131c77892041042805a06064f)
OpenMP 201511 (a.k.a. OpenMP 4.5)
NNPACK is enabled
CPU capability usage: AVX2
CUDA Runtime 10.1
NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_37,code=compute_37
CuDNN 7.6.3
Magma 2.5.2
Build settings: BLAS=MKL, BUILD_TYPE=Release, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DUSE_VULKAN_WRAPPER -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, USE_CUDA=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON,

TorchVision: 0.8.2 OpenCV: 4.7.0 MMCV: 1.5.3 MMCV Compiler: GCC 5.4 MMCV CUDA Compiler: 10.1 MMRotate: 0.3.3+04da23d

Reproduces the problem - code sample

mlvl_bboxes[..., :4] /= mlvl_bboxes[..., :4].new_tensor( scale_factor)

this code is in "mmrotate/models/dense_heads/rotated_reppoints_head.py/ _get_bboxes_single"

Reproduces the problem - command or script

CUDA_VISIBLE_DEVICES=0,1 bash tools/dist_test.sh checkpoint/DotaSodaa800/rotated_reppoints_r50_fpn_1x/rotated_reppoints_r50_fpn_1x.py checkpoint/DotaSodaa800/rotated_reppoints_r50_fpn_1x/epoch_1.pth 2 --work-dir checkpoint/DotaSodaa800/rotated_reppoints_r50_fpn_1x - -eval mAP

Reproduces the problem - error message

File "/media/omnisky/8TDisk/SZC/SODA-mmrotate/tools/test.py", line 267, in main() File "/media/omnisky/8TDisk/SZC/SODA-mmrotate/tools/test.py", line 238, in main outputs = multi_gpu_test(model, data_loader, args.tmpdir, File "/home/omnisky/anaconda3/envs/szcMMrotate/lib/python3.9/site-packages/mmdet/apis/test.py", line 109, in multi_gpu_test result = model(return_loss=False, rescale=True, **data) File "/home/omnisky/anaconda3/envs/szcMMrotate/lib/python3.9/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(*input, **kwargs) File "/home/omnisky/anaconda3/envs/szcMMrotate/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 619, in forward output = self.module(*inputs[0], **kwargs[0]) File "/home/omnisky/anaconda3/envs/szcMMrotate/lib/python3.9/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(*input, **kwargs) File "/home/omnisky/anaconda3/envs/szcMMrotate/lib/python3.9/site-packages/mmcv/runner/fp16_utils.py", line 116, in new_func return old_func(*args, **kwargs) File "/home/omnisky/anaconda3/envs/szcMMrotate/lib/python3.9/site-packages/mmdet/models/detectors/base.py", line 174, in forward return self.forward_test(img, img_metas, **kwargs) File "/home/omnisky/anaconda3/envs/szcMMrotate/lib/python3.9/site-packages/mmdet/models/detectors/base.py", line 147, in forward_test return self.simple_test(imgs[0], img_metas[0], **kwargs) File "/media/omnisky/8TDisk/SZC/SODA-mmrotate/mmrotate/models/detectors/single_stage.py", line 101, in simple_test bbox_list = self.bbox_head.get_bboxes( File "/home/omnisky/anaconda3/envs/szcMMrotate/lib/python3.9/site-packages/mmcv/runner/fp16_utils.py", line 205, in new_func return old_func(*args, **kwargs) File "/media/omnisky/8TDisk/SZC/SODA-mmrotate/mmrotate/models/dense_heads/rotated_reppoints_head.py", line 1066, in get_bboxes results = self._get_bboxes_single(cls_score_list, point_pred_list, File "/media/omnisky/8TDisk/SZC/SODA-mmrotate/mmrotate/models/dense_heads/rotated_reppoints_head.py", line 1159, in _get_bboxes_single mlvl_bboxes[..., :4] /= mlvl_bboxes[..., :4].new_tensor( RuntimeError: CUDA error: an illegal memory access was encountered

Additional information

the dataset i used is not an official dataset and i convert it to the DOTAv1 annotation formart.
I have used it to train several networks，like oriented-rcnn and oriented-faster-rcnn, and never had this question.
there is the config of the rotated-reppoint which i determined by myself: dataset_type = 'DOTADataset' CLASSES = ('airplane', 'helicopter', 'small-vehicle', 'large-vehicle', 'ship', 'container', 'storage-tank', 'swimming-pool', 'windmill') work_dir = 'checkpoint/DotaSodaa800/rotated_reppoints_r50_fpn_1x' angle_version = 'le90' img_norm_cfg = dict( mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True) train_pipeline = [ dict(type='LoadImageFromFile'), dict(type='LoadAnnotations', with_bbox=True), dict(type='RResize', img_scale=(1000, 1000)), dict(type='RRandomFlip', flip_ratio=0.5), dict( type='Normalize', mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True), dict(type='Pad', size_divisor=32), dict(type='DefaultFormatBundle'), dict(type='Collect', keys=['img', 'gt_bboxes', 'gt_labels']) ] test_pipeline = [ dict(type='LoadImageFromFile'), dict( type='MultiScaleFlipAug', img_scale=(1000, 1000), flip=False, transforms=[ dict(type='RResize'), dict( type='Normalize', mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True), dict(type='Pad', size_divisor=32), dict(type='DefaultFormatBundle'), dict(type='Collect', keys=['img']) ]) ] data = dict( samples_per_gpu=1, workers_per_gpu=2, train=dict( type='DOTADataset', version='le90', classes=('airplane', 'helicopter', 'small-vehicle', 'large-vehicle', 'ship', 'container', 'storage-tank', 'swimming-pool', 'windmill'), ann_file='dataset/sodaa800/train/dota/', img_prefix='dataset/sodaa800/train/Images/', pipeline=[ dict(type='LoadImageFromFile'), dict(type='LoadAnnotations', with_bbox=True), dict(type='RResize', img_scale=(1000, 1000)), dict(type='RRandomFlip', flip_ratio=0.5), dict( type='Normalize', mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True), dict(type='Pad', size_divisor=32), dict(type='DefaultFormatBundle'), dict(type='Collect', keys=['img', 'gt_bboxes', 'gt_labels']) ]), val=dict( type='DOTADataset', version='le90', classes=('airplane', 'helicopter', 'small-vehicle', 'large-vehicle', 'ship', 'container', 'storage-tank', 'swimming-pool', 'windmill'), ann_file='dataset/sodaa800/val/dota/', img_prefix='dataset/sodaa800/val/Images/', pipeline=[ dict(type='LoadImageFromFile'), dict( type='MultiScaleFlipAug', img_scale=(1000, 1000), flip=False, transforms=[ dict(type='RResize'), dict( type='Normalize', mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True), dict(type='Pad', size_divisor=32), dict(type='DefaultFormatBundle'), dict(type='Collect', keys=['img']) ]) ]), test=dict( type='DOTADataset', version='le90', classes=('airplane', 'helicopter', 'small-vehicle', 'large-vehicle', 'ship', 'container', 'storage-tank', 'swimming-pool', 'windmill'), ann_file='dataset/sodaa800/test/dota/', img_prefix='dataset/sodaa800/test/Images/', pipeline=[ dict(type='LoadImageFromFile'), dict( type='MultiScaleFlipAug', img_scale=(1000, 1000), flip=False, transforms=[ dict(type='RResize'), dict( type='Normalize', mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True), dict(type='Pad', size_divisor=32), dict(type='DefaultFormatBundle'), dict(type='Collect', keys=['img']) ]) ])) norm_cfg = dict(type='GN', num_groups=32, requires_grad=True) model = dict( type='RotatedRepPoints', backbone=dict( type='ResNet', depth=50, num_stages=4, out_indices=(0, 1, 2, 3), frozen_stages=1, zero_init_residual=False, norm_cfg=dict(type='BN', requires_grad=True), norm_eval=True, style='pytorch', init_cfg=dict(type='Pretrained', checkpoint='torchvision://resnet50')), neck=dict( type='FPN', in_channels=[256, 512, 1024, 2048], out_channels=256, start_level=0, add_extra_convs='on_input', num_outs=5, norm_cfg=dict(type='GN', num_groups=32, requires_grad=True)), bbox_head=dict( type='RotatedRepPointsHead', num_classes=9, in_channels=256, feat_channels=256, point_feat_channels=256, stacked_convs=3, num_points=9, gradient_mul=0.3, point_strides=[4, 8, 16, 32, 64], point_base_scale=2, norm_cfg=dict(type='GN', num_groups=32, requires_grad=True), loss_cls=dict( type='FocalLoss', use_sigmoid=True, gamma=2.0, alpha=0.25, loss_weight=1.0), loss_bbox_init=dict(type='ConvexGIoULoss', loss_weight=0.375), loss_bbox_refine=dict(type='ConvexGIoULoss', loss_weight=1.0), transform_method='rotrect', use_reassign=False, topk=6, anti_factor=0.75), train_cfg=dict( init=dict( assigner=dict(type='ConvexAssigner', scale=4, pos_num=1), allowed_border=-1, pos_weight=-1, debug=False), refine=dict( assigner=dict( type='MaxConvexIoUAssigner', pos_iou_thr=0.4, neg_iou_thr=0.3, min_pos_iou=0, ignore_iof_thr=-1), allowed_border=-1, pos_weight=-1, debug=False)), test_cfg=dict( nms_pre=2000, min_bbox_size=0, score_thr=0.05, nms=dict(iou_thr=0.4), max_per_img=2000)) evaluation = dict(interval=1, metric='mAP') optimizer = dict(type='SGD', lr=0.0025, momentum=0.9, weight_decay=0.0001) optimizer_config = dict(grad_clip=dict(max_norm=35, norm_type=2)) lr_config = dict( policy='step', warmup='linear', warmup_iters=500, warmup_ratio=0.3333333333333333, step=[8, 11]) runner = dict(type='EpochBasedRunner', max_epochs=12) checkpoint_config = dict(interval=1) log_config = dict(interval=50, hooks=[dict(type='TextLoggerHook')]) dist_params = dict(backend='nccl') log_level = 'INFO' load_from = 'demo/oriented_reppoints_r50_fpn_1x_dota_le135-ef072de9.pth' resume_from = None workflow = [('train', 1)] opencv_num_threads = 0 mp_start_method = 'fork' auto_resume = False gpu_ids = range(0, 2)

Mar 30 '23 13:03 ShiZican

#614

Apr 03 '23 03:04 pphgood

Do you solve this problem? I meet the same error~

Jul 25 '23 06:07 Meize0729

I find this bug result from some bad labels. After image split to small size, some bounding box will be cut apart，then some extremely narrow bounding boxes will be generated nearby the new image border. After forward propagation through the network, these bounding boxes will cause the network to generate prediction tensors with particularly large dimensions, then the training process was terminated due to exceeding the computational capacity. once we know the reason, then the solution is easy to find. Below are two solutions: 1.we can find the image and label being processed once training process was terminated, then delete them; 2.find bad labels like below and delete them 屏幕截图 2024-03-12 161210

Mar 12 '24 09:03 walkerinrain

mmrotate mmrotate copied to clipboard

[Bug]" CUDA error: an illegal memory access was encountered" when I evaluate the result of the Rotated-RepPoint

Prerequisite

Task

Branch

Environment

Reproduces the problem - code sample

Reproduces the problem - command or script

Reproduces the problem - error message

Additional information

mmrotate
mmrotate copied to clipboard