unidet3d icon indicating copy to clipboard operation
unidet3d copied to clipboard

RuntimeError: CUDA error: device-side assert triggered

Open Wang-hui-001 opened this issue 9 months ago • 5 comments

Hello, I am very interested in your work, and I am training according to the method you provided. However, the following error started to occur after running twice at checkpoints 513 and 517. Could you please provide a solution?

checkpoint at 517 epochs ../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:365: operator(): block: [131,0,0], thread: [1,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed. Traceback (most recent call last): File "./tools/train.py", line 135, in main() File "./tools/train.py", line 131, in main runner.train() File "/home/wanghui/anaconda3/envs/unidet3d-1/lib/python3.8/site-packages/mmengine/runner/runner.py", line 1777, in train model = self.train_loop.run() # type: ignore File "/home/wanghui/anaconda3/envs/unidet3d-1/lib/python3.8/site-packages/mmengine/runner/loops.py", line 96, in run self.run_epoch() File "/home/wanghui/anaconda3/envs/unidet3d-1/lib/python3.8/site-packages/mmengine/runner/loops.py", line 112, in run_epoch self.run_iter(idx, data_batch) File "/home/wanghui/anaconda3/envs/unidet3d-1/lib/python3.8/site-packages/mmengine/runner/loops.py", line 128, in run_iter outputs = self.runner.model.train_step( File "/home/wanghui/anaconda3/envs/unidet3d-1/lib/python3.8/site-packages/mmengine/model/base_model/base_model.py", line 114, in train_step losses = self._run_forward(data, mode='loss') # type: ignore File "/home/wanghui/anaconda3/envs/unidet3d-1/lib/python3.8/site-packages/mmengine/model/base_model/base_model.py", line 346, in _run_forward results = self(**data, mode=mode) File "/home/wanghui/anaconda3/envs/unidet3d-1/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/home/wanghui/anaconda3/envs/unidet3d-1/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) File "/home/wanghui/DownLoads/unidet3d-master/mmdetection3d/mmdet3d/models/detectors/base.py", line 75, in forward return self.loss(inputs, data_samples, **kwargs) File "/home/wanghui/DownLoads/unidet3d-master/unidet3d/unidet3d.py", line 315, in loss self.get_bboxes_by_masks(gt_masks.T, File "/home/wanghui/DownLoads/unidet3d-master/unidet3d/unidet3d.py", line 241, in get_bboxes_by_masks object_points = points[mask] RuntimeError: CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Wang-hui-001 avatar Mar 22 '25 10:03 Wang-hui-001

hard to say anything specific... are you training on all our 6 datasets, downloaded from huggingface? do all python package versions match our dockerfile?

filaPro avatar Mar 22 '25 14:03 filaPro

Thank you for your reply. I am only training on the ScanNet dataset. I have solved the problem by continuously reducing the batch size. However, my current question is: Why does the testing method provided by Readme use 1024 epochs, when my training results are actually better at 1010 epochs?

Wang-hui-001 avatar Mar 25 '25 02:03 Wang-hui-001

I believe training is quite noisy and the difference between last checkpoints is within statistical error.

filaPro avatar Mar 25 '25 13:03 filaPro

Thank you for your answer. I would also like to ask if you have implemented visualization using Open3D. If so, could you provide the source code?

Wang-hui-001 avatar Mar 26 '25 02:03 Wang-hui-001

I think our vizualization produces .obj files, that we recommend to load in meshlab. But you can load them in open3d.

filaPro avatar Mar 26 '25 14:03 filaPro