Entity icon indicating copy to clipboard operation
Entity copied to clipboard

CUDA error: device-side assert triggered

Open ghost opened this issue 4 years ago • 1 comments

I am trying to training with CUDA11.1 on 2080Ti x4.

The training crashed after ~600 iteration with

[08/28 15:43:34] fvcore.common.checkpoint WARNING: The checkpoint state_dict contains keys that are not used by the model:
  [35mfc1000.{bias, weight}[0m
[08/28 15:43:34] detectron2.engine.train_loop INFO: Starting training from iteration 0
[08/28 15:56:53] detectron2.engine.train_loop ERROR: Exception during training:
Traceback (most recent call last):
  File "/data/detectron2/detectron2/engine/train_loop.py", line 149, in train
    self.run_step()
  File "/data/detectron2/detectron2/engine/defaults.py", line 495, in run_step
    self._trainer.run_step()
  File "/data/detectron2/detectron2/engine/train_loop.py", line 273, in run_step
    loss_dict = self.model(data)
  File "/opt/miniconda3/envs/py38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/miniconda3/envs/py38/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 799, in forward
    output = self.module(*inputs[0], **kwargs[0])
  File "/opt/miniconda3/envs/py38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/data/detectron2/projects/EntitySeg/entityseg/arch.py", line 103, in forward
    num_instances = int(torch.max(instance_map)+1)
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
[08/28 15:56:53] detectron2.engine.hooks INFO: Overall training speed: 653 iterations in 0:13:04 (1.2021 s / it)
[08/28 15:56:53] detectron2.engine.hooks INFO: Total training time: 0:13:05 (0:00:00 on hooks)

ghost avatar Aug 29 '21 07:08 ghost

@qqlu I am using 28174e932c534f841195f02184dc67b941c65a67 with find_unused_parameters=True Any idea what could be the reason?

ghost avatar Aug 30 '21 12:08 ghost