when i train the model in 128 epoch,throw the below error:
/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:115: operator(): block: [127,0,0], thread: [0,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
Traceback (most recent call last):
File "./tools/train.py", line 145, in
main()
File "./tools/train.py", line 141, in main
meta=meta)
File "/home/luyipeng/EfficientLPS/mmdet/apis/train.py", line 102, in train_detector
meta=meta)
File "/home/luyipeng/EfficientLPS/mmdet/apis/train.py", line 182, in _dist_train
runner.run(data_loaders, cfg.workflow, cfg.total_epochs)
File "/home/luyipeng/miniconda3/envs/efficientLPS_env/lib/python3.7/site-packages/mmcv/runner/runner.py", line 384, in run
epoch_runner(data_loaders[i], **kwargs)
File "/home/luyipeng/miniconda3/envs/efficientLPS_env/lib/python3.7/site-packages/mmcv/runner/runner.py", line 283, in train
self.model, data_batch, train_mode=True, **kwargs)
File "/home/luyipeng/EfficientLPS/mmdet/apis/train.py", line 75, in batch_processor
losses = model(**data)
File "/home/luyipeng/miniconda3/envs/efficientLPS_env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/luyipeng/miniconda3/envs/efficientLPS_env/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 619, in forward
output = self.module(*inputs[0], **kwargs[0])
File "/home/luyipeng/miniconda3/envs/efficientLPS_env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/luyipeng/EfficientLPS/mmdet/core/fp16/decorators.py", line 49, in new_func
return old_func(*args, **kwargs)
File "/home/luyipeng/EfficientLPS/mmdet/models/efficientlps/base.py", line 145, in forward
return self.forward_train(img, img_metas, **kwargs)
File "/home/luyipeng/EfficientLPS/mmdet/models/efficientlps/efficientLPS.py", line 205, in forward_train
semantic_logits = self.semantic_head(x[:4], x_range[:4])
File "/home/luyipeng/miniconda3/envs/efficientLPS_env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/luyipeng/EfficientLPS/mmdet/models/mask_heads/efficientlps_semantic_head.py", line 312, in forward
feats[idx] = lateral_conv_ss(feats[idx], r_off)
File "/home/luyipeng/miniconda3/envs/efficientLPS_env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/luyipeng/EfficientLPS/mmdet/models/mask_heads/efficientlps_semantic_head.py", line 224, in forward
x_b = shift_x(x_u, x_range, DDPC_max, self.range_out_b)
File "/home/luyipeng/EfficientLPS/mmdet/models/mask_heads/efficientlps_semantic_head.py", line 39, in shift_x
x_off = torch.from_numpy(np.array([-1, 0, 1, -1, 0, 1, -1, 0, 1])).cuda(x.device)
RuntimeError: CUDA error: device-side assert triggered
I'm confused by the reason of the error and can't handle it ,please help me.
the error occur in "offset_y = D_max * ((x_range_y - x_range_y.min()) / (x_range_y.max()-x_range_y.min()))" when x_range_y.max() equals x_range_y.min(),but i still don't konw how to handle it.
I have the same issue and maybe this will help you:
I trained 3 times on different computers with different capacity of GPUs. Every time i got the same error at the same iteration. I can try to start it again but with the same results. The reason for that is for me until today a mystery.
- Nvidia Quadro K1200: ~Iteration 45.
- Nvidia GeForce RTX 2060: ~Iteration 90.
- 2x Nvidia GeForce RTX 3090: success with 160 Iterations.
I hope this gives you maybe more information, even though it doesnt explain the error.