mmrotate icon indicating copy to clipboard operation
mmrotate copied to clipboard

CUDA error: an illegal memory access was encountered

Open shnew opened this issue 1 year ago • 10 comments

First of all, thank you very much for your work. But it always reports the following error when testing rotated_reppoints:: File "/data2/S/RepPoints_oriented/mmrotate-0.2.0/mmrotate/models/dense_heads/rotated_reppoints_head.py", line 1157, in _get_bboxes_single scale_factor) RuntimeError: CUDA error: an illegal memory access was encountered

How can this be solved? I tried different mmrotate versions and the solutions mentioned in issues, but none of them worked. I really hope to get your help Thank you

shnew avatar Jul 11 '22 12:07 shnew

Please run python mmrotate/utils/collect_env.py to collect necessary environment information and paste it here. @LiWentomng Any suggestions?

yangxue0827 avatar Jul 11 '22 13:07 yangxue0827

fatal: Not a git repository (or any parent up to mount point /data2) Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set). sys.platform: linux Python: 3.7.13 (default, Mar 29 2022, 02:18:16) [GCC 7.5.0] CUDA available: True GPU 0: Tesla V100-PCIE-32GB GPU 1,2,3,4,5,6,7,8,9: GeForce RTX 2080 Ti CUDA_HOME: /data1/shenhui/cuda-10.2:/data1/s/software GCC: gcc (Ubuntu 5.4.0-6ubuntu1~16.04.12) 5.4.0 20160609 PyTorch: 1.8.0 PyTorch compiling details: PyTorch built with:

  • GCC 7.3
  • C++ Version: 201402
  • Intel(R) oneAPI Math Kernel Library Version 2021.4-Product Build 20210904 for Intel(R) 64 architecture applications
  • Intel(R) MKL-DNN v1.7.0 (Git Hash 7aed236906b1f7a05c0917e5257a1af05e9ff683)
  • OpenMP 201511 (a.k.a. OpenMP 4.5)
  • NNPACK is enabled
  • CPU capability usage: AVX2
  • CUDA Runtime 10.2
  • NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_37,code=compute_37
  • CuDNN 7.6.5
  • Magma 2.5.2
  • Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=10.2, CUDNN_VERSION=7.6.5, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.8.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON,

TorchVision: 0.9.0 OpenCV: 4.6.0 MMCV: 1.5.0 MMCV Compiler: GCC 7.3 MMCV CUDA Compiler: 10.2 MMRotate: 0.2.0+

shnew avatar Jul 11 '22 13:07 shnew

Please try the blow suggestions: (1) Change the image size (1024, 1024) to (960, 960) for test config. This can work while it decreases the performance slightly. (2) Try to change the codes in the function _get_bboxes_single. Change

            poly_pred = self.points2rotrect(points_pred, y_first=True)
            bbox_pos_center = points[:, :2].repeat(1, 4)
            polys = poly_pred * self.point_strides[level_idx] + bbox_pos_center
            bboxes = poly2obb(polys, self.version)

to

            pts_pred = points_pred.reshape(-1, self.num_points, 2)
            pts_pred_offsety = pts_pred[:, :, 0::2]
            pts_pred_offsetx = pts_pred[:, :, 1::2]
            pts_pred = torch.cat([pts_pred_offsetx, pts_pred_offsety],
                                 dim=2).reshape(-1, 2 * self.num_points)

            pts_pos_center = points[:, :2].repeat(1, self.num_points)
            pts = pts_pred * self.point_strides[level_idx] + pts_pos_center

            polys = min_area_polygons(pts)
            bboxes = poly2obb(polys, self.version)

Because there may exit the bug in cuda function-min_area_polygon when the input is small value. Transferring the predction of point offsets to the real positions in the whole image can avoid this issue sometimes. (3) Use better GPU device, V100 may work better than 2080ti sometimes.

LiWentomng avatar Jul 11 '22 13:07 LiWentomng

Thank you very much for your advice. Unfortunately, I've tried all of your suggestions, and the results show that sometimes they work and sometimes they don't. In order to test smoothly, my solution was to skip the images that would cause errors. Obviously this is not a perfect solution, so I hope someone can fundamentally solve this problem. Thank you!

shnew avatar Jul 12 '22 13:07 shnew

I guess it is because some sub-image in DOTA2.0 contain many objects, which causes some cuda operators to take up a lot of memory.

yangxue0827 avatar Jul 13 '22 03:07 yangxue0827

This could be the reason, and it also happened when I used DOTA1.0.

shnew avatar Jul 13 '22 03:07 shnew

Hi, I've faced this too, any workaround? (I ran using V100 GPUs)

austinmw avatar Jul 22 '22 15:07 austinmw

A successful solution: set smaller nms_pre

test_cfg=dict(
        nms_pre=1000,
        min_bbox_size=0,
        score_thr=0.05,
        nms=dict(iou_thr=0.4),
        max_per_img=2000))

yangxue0827 avatar Aug 14 '22 03:08 yangxue0827

@yangxue0827 I still get this error even with that change:

    test_cfg=dict(
        nms_pre=1000,
        min_bbox_size=0,
        score_thr=0.05,
        nms=dict(iou_thr=0.4),
        max_per_img=2000))

2022-08-15 22:09:10,988 - mmrotate - INFO - Epoch [1][50/2357]#011lr: 3.189e-03, eta: 2:20:54, time: 1.813, data_time: 0.095, memory: 8398, loss_cls: 0.8521, loss_pts_init: 0.3214, loss_pts_refine: 0.2961, loss_spatial_init: 0.0152, loss_spatial_refine: 0.0002, loss: 1.4849, grad_norm: 1.8358 2022-08-15 22:10:44,014 - mmrotate - INFO - Epoch [1][100/2357]#011lr: 3.723e-03, eta: 2:21:14, time: 1.861, data_time: 0.016, memory: 8398, loss_cls: 0.3215, loss_pts_init: 0.3107, loss_pts_refine: 0.2996, loss_spatial_init: 0.0153, loss_spatial_refine: 0.0001, loss: 0.9473, grad_norm: 1.8034 2022-08-15 22:12:20,123 - mmrotate - INFO - Epoch [1][150/2357]#011lr: 4.256e-03, eta: 2:21:52, time: 1.922, data_time: 0.017, memory: 8398, loss_cls: 0.2361, loss_pts_init: 0.3174, loss_pts_refine: 0.2976, loss_spatial_init: 0.0155, loss_spatial_refine: 0.0002, loss: 0.8668, grad_norm: 1.6680 2022-08-15 22:13:48,646 - mmrotate - INFO - Epoch [1][200/2357]#011lr: 4.789e-03, eta: 2:18:32, time: 1.770, data_time: 0.018, memory: 8398, loss_cls: 0.2194, loss_pts_init: 0.3092, loss_pts_refine: 0.2999, loss_spatial_init: 0.0150, loss_spatial_refine: 0.0001, loss: 0.8437, grad_norm: 1.7740 2022-08-15 22:15:09,347 - mmrotate - INFO - Epoch [1][250/2357]#011lr: 5.323e-03, eta: 2:13:37, time: 1.614, data_time: 0.018, memory: 8398, loss_cls: 0.2059, loss_pts_init: 0.3087, loss_pts_refine: 0.2911, loss_spatial_init: 0.0165, loss_spatial_refine: 0.0002, loss: 0.8223, grad_norm: 1.6533 2022-08-15 22:16:37,447 - mmrotate - INFO - Epoch [1][300/2357]#011lr: 5.856e-03, eta: 2:11:42, time: 1.762, data_time: 0.016, memory: 8398, loss_cls: 0.1891, loss_pts_init: 0.3136, loss_pts_refine: 0.2940, loss_spatial_init: 0.0151, loss_spatial_refine: 0.0001, loss: 0.8120, grad_norm: 1.6122 2022-08-15 22:17:57,952 - mmrotate - INFO - Epoch [1][350/2357]#011lr: 6.389e-03, eta: 2:08:20, time: 1.610, data_time: 0.016, memory: 8398, loss_cls: 0.1903, loss_pts_init: 0.3057, loss_pts_refine: 0.2905, loss_spatial_init: 0.0162, loss_spatial_refine: 0.0002, loss: 0.8028, grad_norm: 1.6764 2022-08-15 22:19:23,388 - mmrotate - INFO - Epoch [1][400/2357]#011lr: 6.923e-03, eta: 2:06:22, time: 1.709, data_time: 0.017, memory: 8398, loss_cls: 0.1903, loss_pts_init: 0.3148, loss_pts_refine: 0.2962, loss_spatial_init: 0.0163, loss_spatial_refine: 0.0002, loss: 0.8178, grad_norm: 1.6627 2022-08-15 22:20:54,942 - mmrotate - INFO - Epoch [1][450/2357]#011lr: 7.456e-03, eta: 2:05:29, time: 1.831, data_time: 0.017, memory: 8398, loss_cls: 0.1913, loss_pts_init: 0.3114, loss_pts_refine: 0.2970, loss_spatial_init: 0.0158, loss_spatial_refine: 0.0002, loss: 0.8157, grad_norm: 1.7099 Traceback (most recent call last): File "/opt/ml/code/mmrotate/tools/train.py", line 192, in main() File "/opt/ml/code/mmrotate/tools/train.py", line 181, in main train_detector( File "/opt/conda/lib/python3.8/site-packages/mmrotate/apis/train.py", line 141, in train_detector runner.run(data_loaders, cfg.workflow) File "/opt/conda/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 130, in run epoch_runner(data_loaders[i], **kwargs) File "/opt/conda/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 51, in train self.run_iter(data_batch, train_mode=True, **kwargs) File "/opt/conda/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 29, in run_iter outputs = self.model.train_step(data_batch, self.optimizer, File "/opt/conda/lib/python3.8/site-packages/mmcv/parallel/distributed.py", line 59, in train_step output = self.module.train_step(*inputs[0], **kwargs[0]) File "/opt/conda/lib/python3.8/site-packages/mmdet/models/detectors/base.py", line 248, in train_step losses = self(**data) File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl return forward_call(*input, **kwargs) File "/opt/conda/lib/python3.8/site-packages/mmcv/runner/fp16_utils.py", line 116, in new_func return old_func(*args, **kwargs) File "/opt/conda/lib/python3.8/site-packages/mmdet/models/detectors/base.py", line 172, in forward return self.forward_train(img, img_metas, **kwargs) File "/opt/conda/lib/python3.8/site-packages/mmrotate/models/detectors/single_stage.py", line 81, in forward_train losses = self.bbox_head.forward_train(x, img_metas, gt_bboxes, File "/opt/conda/lib/python3.8/site-packages/mmdet/models/dense_heads/base_dense_head.py", line 335, in forward_train losses = self.loss(*loss_inputs, gt_bboxes_ignore=gt_bboxes_ignore) File "/opt/conda/lib/python3.8/site-packages/mmrotate/models/dense_heads/oriented_reppoints_head.py", line 952, in loss quality_assess_list, = multi_apply( File "/opt/conda/lib/python3.8/site-packages/mmdet/core/utils/misc.py", line 30, in multi_apply return tuple(map(list, zip(*map_results))) File "/opt/conda/lib/python3.8/site-packages/mmrotate/models/dense_heads/oriented_reppoints_head.py", line 480, in pointsets_quality_assessment sampling_pts_pred_init = self.sampling_points( File "/opt/conda/lib/python3.8/site-packages/mmrotate/models/dense_heads/oriented_reppoints_head.py", line 342, in sampling_points ratio = torch.linspace(0, 1, points_num).to(device).repeat( RuntimeError: CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. terminate called after throwing an instance of 'c10::CUDAError' what(): CUDA error: an illegal memory access was encountered

Even tried with the V100's that are configured with 32GB of memory instead of 16GB (AWS p3dn.24xlarge instances)

austinmw avatar Aug 15 '22 22:08 austinmw

set smaller nms_pre also occur this bug, any updates now?

chunibyo-wly avatar Sep 01 '22 02:09 chunibyo-wly

Adam->SGD according to https://github.com/open-mmlab/mmrotate/issues/614#issuecomment-1333101855

yangxue0827 avatar Dec 01 '22 03:12 yangxue0827

Please try the blow suggestions: (1) Change the image size (1024, 1024) to (960, 960) for test config. This can work while it decreases the performance slightly. (2) Try to change the codes in the function _get_bboxes_single. Change

            poly_pred = self.points2rotrect(points_pred, y_first=True)
            bbox_pos_center = points[:, :2].repeat(1, 4)
            polys = poly_pred * self.point_strides[level_idx] + bbox_pos_center
            bboxes = poly2obb(polys, self.version)

to

            pts_pred = points_pred.reshape(-1, self.num_points, 2)
            pts_pred_offsety = pts_pred[:, :, 0::2]
            pts_pred_offsetx = pts_pred[:, :, 1::2]
            pts_pred = torch.cat([pts_pred_offsetx, pts_pred_offsety],
                                 dim=2).reshape(-1, 2 * self.num_points)

            pts_pos_center = points[:, :2].repeat(1, self.num_points)
            pts = pts_pred * self.point_strides[level_idx] + pts_pos_center

            polys = min_area_polygons(pts)
            bboxes = poly2obb(polys, self.version)

Because there may exit the bug in cuda function-min_area_polygon when the input is small value. Transferring the predction of point offsets to the real positions in the whole image can avoid this issue sometimes. (3) Use better GPU device, V100 may work better than 2080ti sometimes.

Its worked for me, great!!!

xiaolinyezi avatar Dec 09 '22 06:12 xiaolinyezi

@yangxue0827 I still get this error even with that change:

    test_cfg=dict(
        nms_pre=1000,
        min_bbox_size=0,
        score_thr=0.05,
        nms=dict(iou_thr=0.4),
        max_per_img=2000))

2022-08-15 22:09:10,988 - mmrotate - INFO - Epoch [1][50/2357]#011lr: 3.189e-03, eta: 2:20:54, time: 1.813, data_time: 0.095, memory: 8398, loss_cls: 0.8521, loss_pts_init: 0.3214, loss_pts_refine: 0.2961, loss_spatial_init: 0.0152, loss_spatial_refine: 0.0002, loss: 1.4849, grad_norm: 1.8358 2022-08-15 22:10:44,014 - mmrotate - INFO - Epoch [1][100/2357]#011lr: 3.723e-03, eta: 2:21:14, time: 1.861, data_time: 0.016, memory: 8398, loss_cls: 0.3215, loss_pts_init: 0.3107, loss_pts_refine: 0.2996, loss_spatial_init: 0.0153, loss_spatial_refine: 0.0001, loss: 0.9473, grad_norm: 1.8034 2022-08-15 22:12:20,123 - mmrotate - INFO - Epoch [1][150/2357]#011lr: 4.256e-03, eta: 2:21:52, time: 1.922, data_time: 0.017, memory: 8398, loss_cls: 0.2361, loss_pts_init: 0.3174, loss_pts_refine: 0.2976, loss_spatial_init: 0.0155, loss_spatial_refine: 0.0002, loss: 0.8668, grad_norm: 1.6680 2022-08-15 22:13:48,646 - mmrotate - INFO - Epoch [1][200/2357]#011lr: 4.789e-03, eta: 2:18:32, time: 1.770, data_time: 0.018, memory: 8398, loss_cls: 0.2194, loss_pts_init: 0.3092, loss_pts_refine: 0.2999, loss_spatial_init: 0.0150, loss_spatial_refine: 0.0001, loss: 0.8437, grad_norm: 1.7740 2022-08-15 22:15:09,347 - mmrotate - INFO - Epoch [1][250/2357]#011lr: 5.323e-03, eta: 2:13:37, time: 1.614, data_time: 0.018, memory: 8398, loss_cls: 0.2059, loss_pts_init: 0.3087, loss_pts_refine: 0.2911, loss_spatial_init: 0.0165, loss_spatial_refine: 0.0002, loss: 0.8223, grad_norm: 1.6533 2022-08-15 22:16:37,447 - mmrotate - INFO - Epoch [1][300/2357]#011lr: 5.856e-03, eta: 2:11:42, time: 1.762, data_time: 0.016, memory: 8398, loss_cls: 0.1891, loss_pts_init: 0.3136, loss_pts_refine: 0.2940, loss_spatial_init: 0.0151, loss_spatial_refine: 0.0001, loss: 0.8120, grad_norm: 1.6122 2022-08-15 22:17:57,952 - mmrotate - INFO - Epoch [1][350/2357]#011lr: 6.389e-03, eta: 2:08:20, time: 1.610, data_time: 0.016, memory: 8398, loss_cls: 0.1903, loss_pts_init: 0.3057, loss_pts_refine: 0.2905, loss_spatial_init: 0.0162, loss_spatial_refine: 0.0002, loss: 0.8028, grad_norm: 1.6764 2022-08-15 22:19:23,388 - mmrotate - INFO - Epoch [1][400/2357]#011lr: 6.923e-03, eta: 2:06:22, time: 1.709, data_time: 0.017, memory: 8398, loss_cls: 0.1903, loss_pts_init: 0.3148, loss_pts_refine: 0.2962, loss_spatial_init: 0.0163, loss_spatial_refine: 0.0002, loss: 0.8178, grad_norm: 1.6627 2022-08-15 22:20:54,942 - mmrotate - INFO - Epoch [1][450/2357]#011lr: 7.456e-03, eta: 2:05:29, time: 1.831, data_time: 0.017, memory: 8398, loss_cls: 0.1913, loss_pts_init: 0.3114, loss_pts_refine: 0.2970, loss_spatial_init: 0.0158, loss_spatial_refine: 0.0002, loss: 0.8157, grad_norm: 1.7099 Traceback (most recent call last): File "/opt/ml/code/mmrotate/tools/train.py", line 192, in main() File "/opt/ml/code/mmrotate/tools/train.py", line 181, in main train_detector( File "/opt/conda/lib/python3.8/site-packages/mmrotate/apis/train.py", line 141, in train_detector runner.run(data_loaders, cfg.workflow) File "/opt/conda/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 130, in run epoch_runner(data_loaders[i], **kwargs) File "/opt/conda/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 51, in train self.run_iter(data_batch, train_mode=True, **kwargs) File "/opt/conda/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 29, in run_iter outputs = self.model.train_step(data_batch, self.optimizer, File "/opt/conda/lib/python3.8/site-packages/mmcv/parallel/distributed.py", line 59, in train_step output = self.module.train_step(*inputs[0], **kwargs[0]) File "/opt/conda/lib/python3.8/site-packages/mmdet/models/detectors/base.py", line 248, in train_step losses = self(**data) File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl return forward_call(*input, **kwargs) File "/opt/conda/lib/python3.8/site-packages/mmcv/runner/fp16_utils.py", line 116, in new_func return old_func(*args, **kwargs) File "/opt/conda/lib/python3.8/site-packages/mmdet/models/detectors/base.py", line 172, in forward return self.forward_train(img, img_metas, **kwargs) File "/opt/conda/lib/python3.8/site-packages/mmrotate/models/detectors/single_stage.py", line 81, in forward_train losses = self.bbox_head.forward_train(x, img_metas, gt_bboxes, File "/opt/conda/lib/python3.8/site-packages/mmdet/models/dense_heads/base_dense_head.py", line 335, in forward_train losses = self.loss(*loss_inputs, gt_bboxes_ignore=gt_bboxes_ignore) File "/opt/conda/lib/python3.8/site-packages/mmrotate/models/dense_heads/oriented_reppoints_head.py", line 952, in loss quality_assess_list, = multi_apply( File "/opt/conda/lib/python3.8/site-packages/mmdet/core/utils/misc.py", line 30, in multi_apply return tuple(map(list, zip(*map_results))) File "/opt/conda/lib/python3.8/site-packages/mmrotate/models/dense_heads/oriented_reppoints_head.py", line 480, in pointsets_quality_assessment sampling_pts_pred_init = self.sampling_points( File "/opt/conda/lib/python3.8/site-packages/mmrotate/models/dense_heads/oriented_reppoints_head.py", line 342, in sampling_points ratio = torch.linspace(0, 1, points_num).to(device).repeat( RuntimeError: CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. terminate called after throwing an instance of 'c10::CUDAError' what(): CUDA error: an illegal memory access was encountered

Even tried with the V100's that are configured with 32GB of memory instead of 16GB (AWS p3dn.24xlarge instances)

@yangxue0827 I still get this error even with that change:

    test_cfg=dict(
        nms_pre=1000,
        min_bbox_size=0,
        score_thr=0.05,
        nms=dict(iou_thr=0.4),
        max_per_img=2000))

2022-08-15 22:09:10,988 - mmrotate - INFO - Epoch [1][50/2357]#011lr: 3.189e-03, eta: 2:20:54, time: 1.813, data_time: 0.095, memory: 8398, loss_cls: 0.8521, loss_pts_init: 0.3214, loss_pts_refine: 0.2961, loss_spatial_init: 0.0152, loss_spatial_refine: 0.0002, loss: 1.4849, grad_norm: 1.8358 2022-08-15 22:10:44,014 - mmrotate - INFO - Epoch [1][100/2357]#011lr: 3.723e-03, eta: 2:21:14, time: 1.861, data_time: 0.016, memory: 8398, loss_cls: 0.3215, loss_pts_init: 0.3107, loss_pts_refine: 0.2996, loss_spatial_init: 0.0153, loss_spatial_refine: 0.0001, loss: 0.9473, grad_norm: 1.8034 2022-08-15 22:12:20,123 - mmrotate - INFO - Epoch [1][150/2357]#011lr: 4.256e-03, eta: 2:21:52, time: 1.922, data_time: 0.017, memory: 8398, loss_cls: 0.2361, loss_pts_init: 0.3174, loss_pts_refine: 0.2976, loss_spatial_init: 0.0155, loss_spatial_refine: 0.0002, loss: 0.8668, grad_norm: 1.6680 2022-08-15 22:13:48,646 - mmrotate - INFO - Epoch [1][200/2357]#011lr: 4.789e-03, eta: 2:18:32, time: 1.770, data_time: 0.018, memory: 8398, loss_cls: 0.2194, loss_pts_init: 0.3092, loss_pts_refine: 0.2999, loss_spatial_init: 0.0150, loss_spatial_refine: 0.0001, loss: 0.8437, grad_norm: 1.7740 2022-08-15 22:15:09,347 - mmrotate - INFO - Epoch [1][250/2357]#011lr: 5.323e-03, eta: 2:13:37, time: 1.614, data_time: 0.018, memory: 8398, loss_cls: 0.2059, loss_pts_init: 0.3087, loss_pts_refine: 0.2911, loss_spatial_init: 0.0165, loss_spatial_refine: 0.0002, loss: 0.8223, grad_norm: 1.6533 2022-08-15 22:16:37,447 - mmrotate - INFO - Epoch [1][300/2357]#011lr: 5.856e-03, eta: 2:11:42, time: 1.762, data_time: 0.016, memory: 8398, loss_cls: 0.1891, loss_pts_init: 0.3136, loss_pts_refine: 0.2940, loss_spatial_init: 0.0151, loss_spatial_refine: 0.0001, loss: 0.8120, grad_norm: 1.6122 2022-08-15 22:17:57,952 - mmrotate - INFO - Epoch [1][350/2357]#011lr: 6.389e-03, eta: 2:08:20, time: 1.610, data_time: 0.016, memory: 8398, loss_cls: 0.1903, loss_pts_init: 0.3057, loss_pts_refine: 0.2905, loss_spatial_init: 0.0162, loss_spatial_refine: 0.0002, loss: 0.8028, grad_norm: 1.6764 2022-08-15 22:19:23,388 - mmrotate - INFO - Epoch [1][400/2357]#011lr: 6.923e-03, eta: 2:06:22, time: 1.709, data_time: 0.017, memory: 8398, loss_cls: 0.1903, loss_pts_init: 0.3148, loss_pts_refine: 0.2962, loss_spatial_init: 0.0163, loss_spatial_refine: 0.0002, loss: 0.8178, grad_norm: 1.6627 2022-08-15 22:20:54,942 - mmrotate - INFO - Epoch [1][450/2357]#011lr: 7.456e-03, eta: 2:05:29, time: 1.831, data_time: 0.017, memory: 8398, loss_cls: 0.1913, loss_pts_init: 0.3114, loss_pts_refine: 0.2970, loss_spatial_init: 0.0158, loss_spatial_refine: 0.0002, loss: 0.8157, grad_norm: 1.7099 Traceback (most recent call last): File "/opt/ml/code/mmrotate/tools/train.py", line 192, in main() File "/opt/ml/code/mmrotate/tools/train.py", line 181, in main train_detector( File "/opt/conda/lib/python3.8/site-packages/mmrotate/apis/train.py", line 141, in train_detector runner.run(data_loaders, cfg.workflow) File "/opt/conda/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 130, in run epoch_runner(data_loaders[i], **kwargs) File "/opt/conda/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 51, in train self.run_iter(data_batch, train_mode=True, **kwargs) File "/opt/conda/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 29, in run_iter outputs = self.model.train_step(data_batch, self.optimizer, File "/opt/conda/lib/python3.8/site-packages/mmcv/parallel/distributed.py", line 59, in train_step output = self.module.train_step(*inputs[0], **kwargs[0]) File "/opt/conda/lib/python3.8/site-packages/mmdet/models/detectors/base.py", line 248, in train_step losses = self(**data) File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl return forward_call(*input, **kwargs) File "/opt/conda/lib/python3.8/site-packages/mmcv/runner/fp16_utils.py", line 116, in new_func return old_func(*args, **kwargs) File "/opt/conda/lib/python3.8/site-packages/mmdet/models/detectors/base.py", line 172, in forward return self.forward_train(img, img_metas, **kwargs) File "/opt/conda/lib/python3.8/site-packages/mmrotate/models/detectors/single_stage.py", line 81, in forward_train losses = self.bbox_head.forward_train(x, img_metas, gt_bboxes, File "/opt/conda/lib/python3.8/site-packages/mmdet/models/dense_heads/base_dense_head.py", line 335, in forward_train losses = self.loss(*loss_inputs, gt_bboxes_ignore=gt_bboxes_ignore) File "/opt/conda/lib/python3.8/site-packages/mmrotate/models/dense_heads/oriented_reppoints_head.py", line 952, in loss quality_assess_list, = multi_apply( File "/opt/conda/lib/python3.8/site-packages/mmdet/core/utils/misc.py", line 30, in multi_apply return tuple(map(list, zip(*map_results))) File "/opt/conda/lib/python3.8/site-packages/mmrotate/models/dense_heads/oriented_reppoints_head.py", line 480, in pointsets_quality_assessment sampling_pts_pred_init = self.sampling_points( File "/opt/conda/lib/python3.8/site-packages/mmrotate/models/dense_heads/oriented_reppoints_head.py", line 342, in sampling_points ratio = torch.linspace(0, 1, points_num).to(device).repeat( RuntimeError: CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. terminate called after throwing an instance of 'c10::CUDAError' what(): CUDA error: an illegal memory access was encountered

Even tried with the V100's that are configured with 32GB of memory instead of 16GB (AWS p3dn.24xlarge instances)

I have the same problem with you. I am also a V100. I have no problem in training, but this problem will occur when I test,Have you solved it yet

pphgood avatar Mar 20 '23 05:03 pphgood

Thank you very much for your advice. Unfortunately, I've tried all of your suggestions, and the results show that sometimes they work and sometimes they don't. In order to test smoothly, my solution was to skip the images that would cause errors. Obviously this is not a perfect solution, so I hope someone can fundamentally solve this problem. Thank you!

I met the same problem when verifying on the test set. I tried to modify the image size to (960, 960) and nms_pre=1000, but sometimes it works and sometimes it doesn't. May I ask if you have solved this problem now?

pphgood avatar Mar 20 '23 13:03 pphgood

I met the same problem and it didn't work with suitble nms_pre. So any update of this bug now?

silencersai avatar Nov 15 '23 06:11 silencersai

I met the same problem too, anyone can help?

GisRookie avatar Nov 22 '23 12:11 GisRookie

Seems there is no feasible solution

soHardToHaveAName avatar Mar 07 '24 02:03 soHardToHaveAName