OpenPCDet icon indicating copy to clipboard operation
OpenPCDet copied to clipboard

I can train the model, but when it comes to testing or evaluating it, I encounter the error message: 'RuntimeError: CUDA error: no kernel image is available for execution on the device.'

Open cgoldbird opened this issue 1 year ago • 3 comments

I run on Tesla P100. Environment

  • torch 1.8.1+cu111
  • python 3.8
  • cuda 11.1

When running train.py, I encountered an error. It appears that training has completed successfully, but when it transitions to the evaluation phase, the following error occurs: 'RuntimeError: CUDA error: no kernel image is available for execution on the device.

Error!
Traceback (most recent call last):
  File "test.py", line 210, in <module>
    main()
  File "test.py", line 206, in main
    eval_single_ckpt(model, test_loader, args, eval_output_dir, logger, epoch_id, dist_test=dist_test)
  File "test.py", line 65, in eval_single_ckpt
    eval_utils.eval_one_epoch(
  File "/opt/data/private/cyl/OpenPCDet/tools/eval_utils/eval_utils.py", line 65, in eval_one_epoch
    pred_dicts, ret_dict = model(batch_dict)
  File "/root/miniconda3/envs/pcdet/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/opt/data/private/cyl/OpenPCDet/tools/../pcdet/models/detectors/pointpillar.py", line 21, in forward
    pred_dicts, recall_dicts = self.post_processing(batch_dict)
  File "/opt/data/private/cyl/OpenPCDet/tools/../pcdet/models/detectors/detector3d_template.py", line 271, in post_processing
    recall_dict = self.generate_recall_record(
  File "/opt/data/private/cyl/OpenPCDet/tools/../pcdet/models/detectors/detector3d_template.py", line 308, in generate_recall_record
    iou3d_rcnn = iou3d_nms_utils.boxes_iou3d_gpu(box_preds[:, 0:7], cur_gt[:, 0:7])
  File "/opt/data/private/cyl/OpenPCDet/tools/../pcdet/ops/iou3d_nms/iou3d_nms_utils.py", line 69, in boxes_iou3d_gpu
    max_of_min = torch.max(boxes_a_height_min, boxes_b_height_min)
RuntimeError: CUDA error: no kernel image is available for execution on the device
eval:   0%|                     

I have verified that my Torch is accessible via CUDA by conducting the following test:

import torch
import sys
print('A', sys.version)
print('B', torch.__version__)
print('C', torch.cuda.is_available())
print('D', torch.backends.cudnn.enabled)
device = torch.device('cuda')
print('E', torch.cuda.get_device_properties(device))
print('F', torch.tensor([1.0, 2.0]).cuda())

I would like to know if this is due to the Tesla P100 not being suitable for this project? Is there any solution available for this situation? Thank you all for your help.

cgoldbird avatar Apr 29 '24 08:04 cgoldbird

When I replaced the GPUs in the cluster with 2080ti, the test ran smoothly. As it stands now, the issue seems to be that the Tesla P100 cannot run the test. (Although this is the most aggressive solution, at least it solved the problem.)

cgoldbird avatar May 06 '24 08:05 cgoldbird

When I replaced the GPUs in the cluster with 2080ti, the test ran smoothly. As it stands now, the issue seems to be that the Tesla P100 cannot run the test. (Although this is the most aggressive solution, at least it solved the problem.)

Did you set the TORCH_CUDA_ARCH_LIST manually?

I also got this error msg when I didn't install spconv properly. I am running it on RTX5000 rn.

Mike-7777777 avatar May 20 '24 02:05 Mike-7777777

This issue is stale because it has been open for 30 days with no activity.

github-actions[bot] avatar Jun 20 '24 01:06 github-actions[bot]

This issue was closed because it has been inactive for 14 days since being marked as stale.

github-actions[bot] avatar Jul 04 '24 01:07 github-actions[bot]