mmdetection icon indicating copy to clipboard operation
mmdetection copied to clipboard

Implementing problems in Rotated YOLOX

Open liuyanyi opened this issue 3 years ago • 2 comments

I am implementing Rotated YOLOX for MMRotate in https://github.com/open-mmlab/mmrotate/pull/409, SimOTA Assigner has CUDA Error while training.

Compared with mmdet, only get_in_gt_and_in_center_info and bbox_overlaps is different to support rotated detection. After set CUDA_LAUNCH_BLOCKING=1, the error log shows that error may cause by binary_cross_entropy. It's werid because there is no error when training with fp16. Is there any suggestion to debug that?

Error info:

./aten/src/ATen/native/cuda/Loss.cu:115: operator(): block: [58,0,0], thread: [63,0,0] Assertion `input_val >= zero && input_val <= one` failed.
Traceback (most recent call last):
  File "/miniconda3/lib/python3.9/site-packages/mmdet/core/bbox/assigners/sim_ota_assigner.py", line 67, in assign
    assign_result = self._assign(pred_scores, priors, decoded_bboxes,
  File "/workspace/mmrotate/mmrotate/core/bbox/assigners/r_sim_ota_assinger.py", line 85, in _assign
    F.binary_cross_entropy(
  File "/miniconda3/lib/python3.9/site-packages/torch/nn/functional.py", line 3065, in binary_cross_entropy
    return torch._C._nn.binary_cross_entropy(input, target, weight, reduction_enum)
RuntimeError: CUDA error: device-side assert triggered

liuyanyi avatar Jul 25 '22 14:07 liuyanyi

I found the output of network become nan, so bce loss in simota got nan input and trigger the error. Maybe lower lr or gradclip will fix that, i'll do some expriment to figure out.

liuyanyi avatar Jul 27 '22 09:07 liuyanyi

我正在open-mmlab/mmrotate#409 中为 MMRotate 实现旋转 YOLOX,SimOTA 分配器在训练时有 CUDA 错误。

与 mmdet 相比,只有不同之处在于支持旋转检测。设置后,错误日志显示错误可能由binary_cross_entropy导致。这很奇怪,因为使用 fp16 训练时没有错误。有什么建议可以调试吗?get_in_gt_and_in_center_info``bbox_overlaps``CUDA_LAUNCH_BLOCKING=1

错误信息:

./aten/src/ATen/native/cuda/Loss.cu:115: operator(): block: [58,0,0], thread: [63,0,0] Assertion `input_val >= zero && input_val <= one` failed.
Traceback (most recent call last):
  File "/miniconda3/lib/python3.9/site-packages/mmdet/core/bbox/assigners/sim_ota_assigner.py", line 67, in assign
    assign_result = self._assign(pred_scores, priors, decoded_bboxes,
  File "/workspace/mmrotate/mmrotate/core/bbox/assigners/r_sim_ota_assinger.py", line 85, in _assign
    F.binary_cross_entropy(
  File "/miniconda3/lib/python3.9/site-packages/torch/nn/functional.py", line 3065, in binary_cross_entropy
    return torch._C._nn.binary_cross_entropy(input, target, weight, reduction_enum)
RuntimeError: CUDA error: device-side assert triggered

The following error occurred when I used your yoloX. I only changed img_ Scale and num_ classes: Traceback (most recent call last): File "E:\lrk\trail\code\mmrotate-ryolox\tools\train.py", line 196, in main() File "E:\lrk\trail\code\mmrotate-ryolox\tools\train.py", line 185, in main train_detector( File "E:\lrk\trail\code\mmrotate-ryolox\mmrotate\apis\train.py", line 141, in train_detector runner.run(data_loaders, cfg.workflow) File "D:\lrk\local\anaconda3\anaconda\envs\mmrotate\lib\site-packages\mmcv\runner\epoch_based_runner.py", line 136, in run epoch_runner(data_loaders[i], **kwargs) File "D:\lrk\local\anaconda3\anaconda\envs\mmrotate\lib\site-packages\mmcv\runner\epoch_based_runner.py", line 49, in train for i, data_batch in enumerate(self.data_loader): File "D:\lrk\local\anaconda3\anaconda\envs\mmrotate\lib\site-packages\torch\utils\data\dataloader.py", line 517, in next data = self._next_data() File "D:\lrk\local\anaconda3\anaconda\envs\mmrotate\lib\site-packages\torch\utils\data\dataloader.py", line 1179, in _next_data return self._process_data(data) File "D:\lrk\local\anaconda3\anaconda\envs\mmrotate\lib\site-packages\torch\utils\data\dataloader.py", line 1225, in _process_data data.reraise() File "D:\lrk\local\anaconda3\anaconda\envs\mmrotate\lib\site-packages\torch_utils.py", line 429, in reraise raise self.exc_type(msg) RuntimeError: Caught RuntimeError in DataLoader worker process 5. Original Traceback (most recent call last): File "D:\lrk\local\anaconda3\anaconda\envs\mmrotate\lib\site-packages\torch\utils\data_utils\worker.py", line 202, in _worker_loop data = fetcher.fetch(index) File "D:\lrk\local\anaconda3\anaconda\envs\mmrotate\lib\site-packages\torch\utils\data_utils\fetch.py", line 44, in fetch data = [self.dataset[idx] for idx in possibly_batched_index] File "D:\lrk\local\anaconda3\anaconda\envs\mmrotate\lib\site-packages\torch\utils\data_utils\fetch.py", line 44, in data = [self.dataset[idx] for idx in possibly_batched_index] File "D:\lrk\local\anaconda3\anaconda\envs\mmrotate\lib\site-packages\mmdet\datasets\dataset_wrappers.py", line 431, in getitem updated_results = transform(copy.deepcopy(results)) File "D:\lrk\local\anaconda3\anaconda\envs\mmrotate\lib\site-packages\mmdet\datasets\pipelines\transforms.py", line 2326, in call results = self._mixup_transform(results) File "E:\lrk\trail\code\mmrotate-ryolox\mmrotate\datasets\pipelines\transforms.py", line 921, in _mixup_transform mixup_gt_bboxes = poly2obb( File "E:\lrk\trail\code\mmrotate-ryolox\mmrotate\core\bbox\transforms.py", line 110, in poly2obb results = poly2obb_le90(polys) File "E:\lrk\trail\code\mmrotate-ryolox\mmrotate\core\bbox\transforms.py", line 329, in poly2obb_le90 width, _ = torch.max(edges, 1) RuntimeError: cannot perform reduction function max on tensor with no elements because the operation does not have an identity Process finished with exit code 1

19990101lrk avatar Dec 08 '22 03:12 19990101lrk