mmrotate
mmrotate copied to clipboard
[Bug]" CUDA error: an illegal memory access was encountered" when I evaluate the result of the Rotated-RepPoint
Prerequisite
- [X] I have searched Issues and Discussions but cannot get the expected help.
- [X] I have read the FAQ documentation but cannot get the expected help.
- [X] The bug has not been fixed in the latest version (master) or latest version (1.x).
Task
I have modified the scripts/configs, or I'm working on my own tasks/models/datasets.
Branch
master branch https://github.com/open-mmlab/mmrotate
Environment
sys.platform: linux Python: 3.9.16 (main, Mar 8 2023, 14:00:05) [GCC 11.2.0] CUDA available: True GPU 0,1: GeForce RTX 2080 Ti CUDA_HOME: /usr/local/cuda NVCC: Cuda compilation tools, release 10.1, V10.1.24 GCC: gcc (Ubuntu 5.4.0-6ubuntu1~16.04.12) 5.4.0 20160609 PyTorch: 1.7.1 PyTorch compiling details: PyTorch built with:
- GCC 7.3
- C++ Version: 201402
- Intel(R) oneAPI Math Kernel Library Version 2021.4-Product Build 20210904 for Intel(R) 64 architecture applications
- Intel(R) MKL-DNN v1.6.0 (Git Hash 5ef631a030a6f73131c77892041042805a06064f)
- OpenMP 201511 (a.k.a. OpenMP 4.5)
- NNPACK is enabled
- CPU capability usage: AVX2
- CUDA Runtime 10.1
- NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_37,code=compute_37
- CuDNN 7.6.3
- Magma 2.5.2
- Build settings: BLAS=MKL, BUILD_TYPE=Release, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DUSE_VULKAN_WRAPPER -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, USE_CUDA=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON,
TorchVision: 0.8.2 OpenCV: 4.7.0 MMCV: 1.5.3 MMCV Compiler: GCC 5.4 MMCV CUDA Compiler: 10.1 MMRotate: 0.3.3+04da23d
Reproduces the problem - code sample
mlvl_bboxes[..., :4] /= mlvl_bboxes[..., :4].new_tensor( scale_factor)
this code is in "mmrotate/models/dense_heads/rotated_reppoints_head.py/ _get_bboxes_single"
Reproduces the problem - command or script
CUDA_VISIBLE_DEVICES=0,1 bash tools/dist_test.sh checkpoint/DotaSodaa800/rotated_reppoints_r50_fpn_1x/rotated_reppoints_r50_fpn_1x.py checkpoint/DotaSodaa800/rotated_reppoints_r50_fpn_1x/epoch_1.pth 2 --work-dir checkpoint/DotaSodaa800/rotated_reppoints_r50_fpn_1x - -eval mAP
Reproduces the problem - error message
File "/media/omnisky/8TDisk/SZC/SODA-mmrotate/tools/test.py", line 267, in
Additional information
- the dataset i used is not an official dataset and i convert it to the DOTAv1 annotation formart.
- I have used it to train several networks,like oriented-rcnn and oriented-faster-rcnn, and never had this question.
- there is the config of the rotated-reppoint which i determined by myself: dataset_type = 'DOTADataset' CLASSES = ('airplane', 'helicopter', 'small-vehicle', 'large-vehicle', 'ship', 'container', 'storage-tank', 'swimming-pool', 'windmill') work_dir = 'checkpoint/DotaSodaa800/rotated_reppoints_r50_fpn_1x' angle_version = 'le90' img_norm_cfg = dict( mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True) train_pipeline = [ dict(type='LoadImageFromFile'), dict(type='LoadAnnotations', with_bbox=True), dict(type='RResize', img_scale=(1000, 1000)), dict(type='RRandomFlip', flip_ratio=0.5), dict( type='Normalize', mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True), dict(type='Pad', size_divisor=32), dict(type='DefaultFormatBundle'), dict(type='Collect', keys=['img', 'gt_bboxes', 'gt_labels']) ] test_pipeline = [ dict(type='LoadImageFromFile'), dict( type='MultiScaleFlipAug', img_scale=(1000, 1000), flip=False, transforms=[ dict(type='RResize'), dict( type='Normalize', mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True), dict(type='Pad', size_divisor=32), dict(type='DefaultFormatBundle'), dict(type='Collect', keys=['img']) ]) ] data = dict( samples_per_gpu=1, workers_per_gpu=2, train=dict( type='DOTADataset', version='le90', classes=('airplane', 'helicopter', 'small-vehicle', 'large-vehicle', 'ship', 'container', 'storage-tank', 'swimming-pool', 'windmill'), ann_file='dataset/sodaa800/train/dota/', img_prefix='dataset/sodaa800/train/Images/', pipeline=[ dict(type='LoadImageFromFile'), dict(type='LoadAnnotations', with_bbox=True), dict(type='RResize', img_scale=(1000, 1000)), dict(type='RRandomFlip', flip_ratio=0.5), dict( type='Normalize', mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True), dict(type='Pad', size_divisor=32), dict(type='DefaultFormatBundle'), dict(type='Collect', keys=['img', 'gt_bboxes', 'gt_labels']) ]), val=dict( type='DOTADataset', version='le90', classes=('airplane', 'helicopter', 'small-vehicle', 'large-vehicle', 'ship', 'container', 'storage-tank', 'swimming-pool', 'windmill'), ann_file='dataset/sodaa800/val/dota/', img_prefix='dataset/sodaa800/val/Images/', pipeline=[ dict(type='LoadImageFromFile'), dict( type='MultiScaleFlipAug', img_scale=(1000, 1000), flip=False, transforms=[ dict(type='RResize'), dict( type='Normalize', mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True), dict(type='Pad', size_divisor=32), dict(type='DefaultFormatBundle'), dict(type='Collect', keys=['img']) ]) ]), test=dict( type='DOTADataset', version='le90', classes=('airplane', 'helicopter', 'small-vehicle', 'large-vehicle', 'ship', 'container', 'storage-tank', 'swimming-pool', 'windmill'), ann_file='dataset/sodaa800/test/dota/', img_prefix='dataset/sodaa800/test/Images/', pipeline=[ dict(type='LoadImageFromFile'), dict( type='MultiScaleFlipAug', img_scale=(1000, 1000), flip=False, transforms=[ dict(type='RResize'), dict( type='Normalize', mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True), dict(type='Pad', size_divisor=32), dict(type='DefaultFormatBundle'), dict(type='Collect', keys=['img']) ]) ])) norm_cfg = dict(type='GN', num_groups=32, requires_grad=True) model = dict( type='RotatedRepPoints', backbone=dict( type='ResNet', depth=50, num_stages=4, out_indices=(0, 1, 2, 3), frozen_stages=1, zero_init_residual=False, norm_cfg=dict(type='BN', requires_grad=True), norm_eval=True, style='pytorch', init_cfg=dict(type='Pretrained', checkpoint='torchvision://resnet50')), neck=dict( type='FPN', in_channels=[256, 512, 1024, 2048], out_channels=256, start_level=0, add_extra_convs='on_input', num_outs=5, norm_cfg=dict(type='GN', num_groups=32, requires_grad=True)), bbox_head=dict( type='RotatedRepPointsHead', num_classes=9, in_channels=256, feat_channels=256, point_feat_channels=256, stacked_convs=3, num_points=9, gradient_mul=0.3, point_strides=[4, 8, 16, 32, 64], point_base_scale=2, norm_cfg=dict(type='GN', num_groups=32, requires_grad=True), loss_cls=dict( type='FocalLoss', use_sigmoid=True, gamma=2.0, alpha=0.25, loss_weight=1.0), loss_bbox_init=dict(type='ConvexGIoULoss', loss_weight=0.375), loss_bbox_refine=dict(type='ConvexGIoULoss', loss_weight=1.0), transform_method='rotrect', use_reassign=False, topk=6, anti_factor=0.75), train_cfg=dict( init=dict( assigner=dict(type='ConvexAssigner', scale=4, pos_num=1), allowed_border=-1, pos_weight=-1, debug=False), refine=dict( assigner=dict( type='MaxConvexIoUAssigner', pos_iou_thr=0.4, neg_iou_thr=0.3, min_pos_iou=0, ignore_iof_thr=-1), allowed_border=-1, pos_weight=-1, debug=False)), test_cfg=dict( nms_pre=2000, min_bbox_size=0, score_thr=0.05, nms=dict(iou_thr=0.4), max_per_img=2000)) evaluation = dict(interval=1, metric='mAP') optimizer = dict(type='SGD', lr=0.0025, momentum=0.9, weight_decay=0.0001) optimizer_config = dict(grad_clip=dict(max_norm=35, norm_type=2)) lr_config = dict( policy='step', warmup='linear', warmup_iters=500, warmup_ratio=0.3333333333333333, step=[8, 11]) runner = dict(type='EpochBasedRunner', max_epochs=12) checkpoint_config = dict(interval=1) log_config = dict(interval=50, hooks=[dict(type='TextLoggerHook')]) dist_params = dict(backend='nccl') log_level = 'INFO' load_from = 'demo/oriented_reppoints_r50_fpn_1x_dota_le135-ef072de9.pth' resume_from = None workflow = [('train', 1)] opencv_num_threads = 0 mp_start_method = 'fork' auto_resume = False gpu_ids = range(0, 2)
#614
Do you solve this problem? I meet the same error~
I find this bug result from some bad labels.
After image split to small size, some bounding box will be cut apart,then some extremely narrow bounding boxes will be generated nearby the new image border. After forward propagation through the network, these bounding boxes will cause the network to generate prediction tensors with particularly large dimensions, then the training process was terminated due to exceeding the computational capacity.
once we know the reason, then the solution is easy to find. Below are two solutions:
1.we can find the image and label being processed once training process was terminated, then delete them;
2.find bad labels like below and delete them