mmrotate icon indicating copy to clipboard operation
mmrotate copied to clipboard

[Bug]" CUDA error: an illegal memory access was encountered" when I evaluate the result of the Rotated-RepPoint

Open ShiZican opened this issue 1 year ago • 3 comments

Prerequisite

Task

I have modified the scripts/configs, or I'm working on my own tasks/models/datasets.

Branch

master branch https://github.com/open-mmlab/mmrotate

Environment

sys.platform: linux Python: 3.9.16 (main, Mar 8 2023, 14:00:05) [GCC 11.2.0] CUDA available: True GPU 0,1: GeForce RTX 2080 Ti CUDA_HOME: /usr/local/cuda NVCC: Cuda compilation tools, release 10.1, V10.1.24 GCC: gcc (Ubuntu 5.4.0-6ubuntu1~16.04.12) 5.4.0 20160609 PyTorch: 1.7.1 PyTorch compiling details: PyTorch built with:

  • GCC 7.3
  • C++ Version: 201402
  • Intel(R) oneAPI Math Kernel Library Version 2021.4-Product Build 20210904 for Intel(R) 64 architecture applications
  • Intel(R) MKL-DNN v1.6.0 (Git Hash 5ef631a030a6f73131c77892041042805a06064f)
  • OpenMP 201511 (a.k.a. OpenMP 4.5)
  • NNPACK is enabled
  • CPU capability usage: AVX2
  • CUDA Runtime 10.1
  • NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_37,code=compute_37
  • CuDNN 7.6.3
  • Magma 2.5.2
  • Build settings: BLAS=MKL, BUILD_TYPE=Release, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DUSE_VULKAN_WRAPPER -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, USE_CUDA=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON,

TorchVision: 0.8.2 OpenCV: 4.7.0 MMCV: 1.5.3 MMCV Compiler: GCC 5.4 MMCV CUDA Compiler: 10.1 MMRotate: 0.3.3+04da23d

Reproduces the problem - code sample

mlvl_bboxes[..., :4] /= mlvl_bboxes[..., :4].new_tensor( scale_factor)

this code is in "mmrotate/models/dense_heads/rotated_reppoints_head.py/ _get_bboxes_single"

Reproduces the problem - command or script

CUDA_VISIBLE_DEVICES=0,1 bash tools/dist_test.sh checkpoint/DotaSodaa800/rotated_reppoints_r50_fpn_1x/rotated_reppoints_r50_fpn_1x.py checkpoint/DotaSodaa800/rotated_reppoints_r50_fpn_1x/epoch_1.pth 2 --work-dir checkpoint/DotaSodaa800/rotated_reppoints_r50_fpn_1x - -eval mAP

Reproduces the problem - error message

File "/media/omnisky/8TDisk/SZC/SODA-mmrotate/tools/test.py", line 267, in main() File "/media/omnisky/8TDisk/SZC/SODA-mmrotate/tools/test.py", line 238, in main outputs = multi_gpu_test(model, data_loader, args.tmpdir, File "/home/omnisky/anaconda3/envs/szcMMrotate/lib/python3.9/site-packages/mmdet/apis/test.py", line 109, in multi_gpu_test result = model(return_loss=False, rescale=True, **data) File "/home/omnisky/anaconda3/envs/szcMMrotate/lib/python3.9/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(*input, **kwargs) File "/home/omnisky/anaconda3/envs/szcMMrotate/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 619, in forward output = self.module(*inputs[0], **kwargs[0]) File "/home/omnisky/anaconda3/envs/szcMMrotate/lib/python3.9/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(*input, **kwargs) File "/home/omnisky/anaconda3/envs/szcMMrotate/lib/python3.9/site-packages/mmcv/runner/fp16_utils.py", line 116, in new_func return old_func(*args, **kwargs) File "/home/omnisky/anaconda3/envs/szcMMrotate/lib/python3.9/site-packages/mmdet/models/detectors/base.py", line 174, in forward return self.forward_test(img, img_metas, **kwargs) File "/home/omnisky/anaconda3/envs/szcMMrotate/lib/python3.9/site-packages/mmdet/models/detectors/base.py", line 147, in forward_test return self.simple_test(imgs[0], img_metas[0], **kwargs) File "/media/omnisky/8TDisk/SZC/SODA-mmrotate/mmrotate/models/detectors/single_stage.py", line 101, in simple_test bbox_list = self.bbox_head.get_bboxes( File "/home/omnisky/anaconda3/envs/szcMMrotate/lib/python3.9/site-packages/mmcv/runner/fp16_utils.py", line 205, in new_func return old_func(*args, **kwargs) File "/media/omnisky/8TDisk/SZC/SODA-mmrotate/mmrotate/models/dense_heads/rotated_reppoints_head.py", line 1066, in get_bboxes results = self._get_bboxes_single(cls_score_list, point_pred_list, File "/media/omnisky/8TDisk/SZC/SODA-mmrotate/mmrotate/models/dense_heads/rotated_reppoints_head.py", line 1159, in _get_bboxes_single mlvl_bboxes[..., :4] /= mlvl_bboxes[..., :4].new_tensor( RuntimeError: CUDA error: an illegal memory access was encountered

Additional information

  1. the dataset i used is not an official dataset and i convert it to the DOTAv1 annotation formart.
  2. I have used it to train several networks,like oriented-rcnn and oriented-faster-rcnn, and never had this question.
  3. there is the config of the rotated-reppoint which i determined by myself: dataset_type = 'DOTADataset' CLASSES = ('airplane', 'helicopter', 'small-vehicle', 'large-vehicle', 'ship', 'container', 'storage-tank', 'swimming-pool', 'windmill') work_dir = 'checkpoint/DotaSodaa800/rotated_reppoints_r50_fpn_1x' angle_version = 'le90' img_norm_cfg = dict( mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True) train_pipeline = [ dict(type='LoadImageFromFile'), dict(type='LoadAnnotations', with_bbox=True), dict(type='RResize', img_scale=(1000, 1000)), dict(type='RRandomFlip', flip_ratio=0.5), dict( type='Normalize', mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True), dict(type='Pad', size_divisor=32), dict(type='DefaultFormatBundle'), dict(type='Collect', keys=['img', 'gt_bboxes', 'gt_labels']) ] test_pipeline = [ dict(type='LoadImageFromFile'), dict( type='MultiScaleFlipAug', img_scale=(1000, 1000), flip=False, transforms=[ dict(type='RResize'), dict( type='Normalize', mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True), dict(type='Pad', size_divisor=32), dict(type='DefaultFormatBundle'), dict(type='Collect', keys=['img']) ]) ] data = dict( samples_per_gpu=1, workers_per_gpu=2, train=dict( type='DOTADataset', version='le90', classes=('airplane', 'helicopter', 'small-vehicle', 'large-vehicle', 'ship', 'container', 'storage-tank', 'swimming-pool', 'windmill'), ann_file='dataset/sodaa800/train/dota/', img_prefix='dataset/sodaa800/train/Images/', pipeline=[ dict(type='LoadImageFromFile'), dict(type='LoadAnnotations', with_bbox=True), dict(type='RResize', img_scale=(1000, 1000)), dict(type='RRandomFlip', flip_ratio=0.5), dict( type='Normalize', mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True), dict(type='Pad', size_divisor=32), dict(type='DefaultFormatBundle'), dict(type='Collect', keys=['img', 'gt_bboxes', 'gt_labels']) ]), val=dict( type='DOTADataset', version='le90', classes=('airplane', 'helicopter', 'small-vehicle', 'large-vehicle', 'ship', 'container', 'storage-tank', 'swimming-pool', 'windmill'), ann_file='dataset/sodaa800/val/dota/', img_prefix='dataset/sodaa800/val/Images/', pipeline=[ dict(type='LoadImageFromFile'), dict( type='MultiScaleFlipAug', img_scale=(1000, 1000), flip=False, transforms=[ dict(type='RResize'), dict( type='Normalize', mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True), dict(type='Pad', size_divisor=32), dict(type='DefaultFormatBundle'), dict(type='Collect', keys=['img']) ]) ]), test=dict( type='DOTADataset', version='le90', classes=('airplane', 'helicopter', 'small-vehicle', 'large-vehicle', 'ship', 'container', 'storage-tank', 'swimming-pool', 'windmill'), ann_file='dataset/sodaa800/test/dota/', img_prefix='dataset/sodaa800/test/Images/', pipeline=[ dict(type='LoadImageFromFile'), dict( type='MultiScaleFlipAug', img_scale=(1000, 1000), flip=False, transforms=[ dict(type='RResize'), dict( type='Normalize', mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True), dict(type='Pad', size_divisor=32), dict(type='DefaultFormatBundle'), dict(type='Collect', keys=['img']) ]) ])) norm_cfg = dict(type='GN', num_groups=32, requires_grad=True) model = dict( type='RotatedRepPoints', backbone=dict( type='ResNet', depth=50, num_stages=4, out_indices=(0, 1, 2, 3), frozen_stages=1, zero_init_residual=False, norm_cfg=dict(type='BN', requires_grad=True), norm_eval=True, style='pytorch', init_cfg=dict(type='Pretrained', checkpoint='torchvision://resnet50')), neck=dict( type='FPN', in_channels=[256, 512, 1024, 2048], out_channels=256, start_level=0, add_extra_convs='on_input', num_outs=5, norm_cfg=dict(type='GN', num_groups=32, requires_grad=True)), bbox_head=dict( type='RotatedRepPointsHead', num_classes=9, in_channels=256, feat_channels=256, point_feat_channels=256, stacked_convs=3, num_points=9, gradient_mul=0.3, point_strides=[4, 8, 16, 32, 64], point_base_scale=2, norm_cfg=dict(type='GN', num_groups=32, requires_grad=True), loss_cls=dict( type='FocalLoss', use_sigmoid=True, gamma=2.0, alpha=0.25, loss_weight=1.0), loss_bbox_init=dict(type='ConvexGIoULoss', loss_weight=0.375), loss_bbox_refine=dict(type='ConvexGIoULoss', loss_weight=1.0), transform_method='rotrect', use_reassign=False, topk=6, anti_factor=0.75), train_cfg=dict( init=dict( assigner=dict(type='ConvexAssigner', scale=4, pos_num=1), allowed_border=-1, pos_weight=-1, debug=False), refine=dict( assigner=dict( type='MaxConvexIoUAssigner', pos_iou_thr=0.4, neg_iou_thr=0.3, min_pos_iou=0, ignore_iof_thr=-1), allowed_border=-1, pos_weight=-1, debug=False)), test_cfg=dict( nms_pre=2000, min_bbox_size=0, score_thr=0.05, nms=dict(iou_thr=0.4), max_per_img=2000)) evaluation = dict(interval=1, metric='mAP') optimizer = dict(type='SGD', lr=0.0025, momentum=0.9, weight_decay=0.0001) optimizer_config = dict(grad_clip=dict(max_norm=35, norm_type=2)) lr_config = dict( policy='step', warmup='linear', warmup_iters=500, warmup_ratio=0.3333333333333333, step=[8, 11]) runner = dict(type='EpochBasedRunner', max_epochs=12) checkpoint_config = dict(interval=1) log_config = dict(interval=50, hooks=[dict(type='TextLoggerHook')]) dist_params = dict(backend='nccl') log_level = 'INFO' load_from = 'demo/oriented_reppoints_r50_fpn_1x_dota_le135-ef072de9.pth' resume_from = None workflow = [('train', 1)] opencv_num_threads = 0 mp_start_method = 'fork' auto_resume = False gpu_ids = range(0, 2)

ShiZican avatar Mar 30 '23 13:03 ShiZican

#614

pphgood avatar Apr 03 '23 03:04 pphgood

Do you solve this problem? I meet the same error~

Meize0729 avatar Jul 25 '23 06:07 Meize0729

I find this bug result from some bad labels. After image split to small size, some bounding box will be cut apart,then some extremely narrow bounding boxes will be generated nearby the new image border. After forward propagation through the network, these bounding boxes will cause the network to generate prediction tensors with particularly large dimensions, then the training process was terminated due to exceeding the computational capacity. once we know the reason, then the solution is easy to find. Below are two solutions: 1.we can find the image and label being processed once training process was terminated, then delete them; 2.find bad labels like below and delete them 屏幕截图 2024-03-12 161210

walkerinrain avatar Mar 12 '24 09:03 walkerinrain