Entity
Entity copied to clipboard
CUDA error: device-side assert triggered
I am trying to training with CUDA11.1 on 2080Ti x4.
The training crashed after ~600 iteration with
[08/28 15:43:34] fvcore.common.checkpoint WARNING: The checkpoint state_dict contains keys that are not used by the model:
[35mfc1000.{bias, weight}[0m
[08/28 15:43:34] detectron2.engine.train_loop INFO: Starting training from iteration 0
[08/28 15:56:53] detectron2.engine.train_loop ERROR: Exception during training:
Traceback (most recent call last):
File "/data/detectron2/detectron2/engine/train_loop.py", line 149, in train
self.run_step()
File "/data/detectron2/detectron2/engine/defaults.py", line 495, in run_step
self._trainer.run_step()
File "/data/detectron2/detectron2/engine/train_loop.py", line 273, in run_step
loss_dict = self.model(data)
File "/opt/miniconda3/envs/py38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/miniconda3/envs/py38/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 799, in forward
output = self.module(*inputs[0], **kwargs[0])
File "/opt/miniconda3/envs/py38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/data/detectron2/projects/EntitySeg/entityseg/arch.py", line 103, in forward
num_instances = int(torch.max(instance_map)+1)
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
[08/28 15:56:53] detectron2.engine.hooks INFO: Overall training speed: 653 iterations in 0:13:04 (1.2021 s / it)
[08/28 15:56:53] detectron2.engine.hooks INFO: Total training time: 0:13:05 (0:00:00 on hooks)
@qqlu I am using 28174e932c534f841195f02184dc67b941c65a67 with find_unused_parameters=True
Any idea what could be the reason?