Hello, I am very interested in your work, and I am training according to the method you provided. However, the following error started to occur after running twice at checkpoints 513 and 517. Could you please provide a solution?
checkpoint at 517 epochs
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:365: operator(): block: [131,0,0], thread: [1,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
Traceback (most recent call last):
File "./tools/train.py", line 135, in
main()
File "./tools/train.py", line 131, in main
runner.train()
File "/home/wanghui/anaconda3/envs/unidet3d-1/lib/python3.8/site-packages/mmengine/runner/runner.py", line 1777, in train
model = self.train_loop.run() # type: ignore
File "/home/wanghui/anaconda3/envs/unidet3d-1/lib/python3.8/site-packages/mmengine/runner/loops.py", line 96, in run
self.run_epoch()
File "/home/wanghui/anaconda3/envs/unidet3d-1/lib/python3.8/site-packages/mmengine/runner/loops.py", line 112, in run_epoch
self.run_iter(idx, data_batch)
File "/home/wanghui/anaconda3/envs/unidet3d-1/lib/python3.8/site-packages/mmengine/runner/loops.py", line 128, in run_iter
outputs = self.runner.model.train_step(
File "/home/wanghui/anaconda3/envs/unidet3d-1/lib/python3.8/site-packages/mmengine/model/base_model/base_model.py", line 114, in train_step
losses = self._run_forward(data, mode='loss') # type: ignore
File "/home/wanghui/anaconda3/envs/unidet3d-1/lib/python3.8/site-packages/mmengine/model/base_model/base_model.py", line 346, in _run_forward
results = self(**data, mode=mode)
File "/home/wanghui/anaconda3/envs/unidet3d-1/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/wanghui/anaconda3/envs/unidet3d-1/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/wanghui/DownLoads/unidet3d-master/mmdetection3d/mmdet3d/models/detectors/base.py", line 75, in forward
return self.loss(inputs, data_samples, **kwargs)
File "/home/wanghui/DownLoads/unidet3d-master/unidet3d/unidet3d.py", line 315, in loss
self.get_bboxes_by_masks(gt_masks.T,
File "/home/wanghui/DownLoads/unidet3d-master/unidet3d/unidet3d.py", line 241, in get_bboxes_by_masks
object_points = points[mask]
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
hard to say anything specific... are you training on all our 6 datasets, downloaded from huggingface? do all python package versions match our dockerfile?
Thank you for your reply. I am only training on the ScanNet dataset. I have solved the problem by continuously reducing the batch size. However, my current question is: Why does the testing method provided by Readme use 1024 epochs, when my training results are actually better at 1010 epochs?
I believe training is quite noisy and the difference between last checkpoints is within statistical error.
Thank you for your answer. I would also like to ask if you have implemented visualization using Open3D. If so, could you provide the source code?
I think our vizualization produces .obj files, that we recommend to load in meshlab. But you can load them in open3d.