hand_object_detector icon indicating copy to clipboard operation
hand_object_detector copied to clipboard

RuntimeError: CUDA error: device-side assert triggered (can't train the model if batch is not 1)

Open forever208 opened this issue 3 years ago • 4 comments

it seems that the batch_size can only be 1, when I set the batch_size = 4 or 8 during training, the error occurs:

Traceback (most recent call last): File "trainval_net.py", line 321, in rois_label, loss_list = fasterRCNN(im_data, im_info, gt_boxes, num_boxes, box_info) File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, **kwargs) File "/content/Hand-Object-Interaction-detection/lib/model/faster_rcnn/faster_rcnn.py", line 62, in forward roi_data = self.RCNN_proposal_target(rois, gt_boxes, num_boxes, box_info) File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, **kwargs) File "/content/Hand-Object-Interaction-detection/lib/model/rpn/proposal_target_layer_cascade.py", line 52, in forward rois_per_image, self._num_classes, box_info) File "/content/Hand-Object-Interaction-detection/lib/model/rpn/proposal_target_layer_cascade.py", line 146, in _sample_rois_pytorch fg_inds = torch.nonzero(max_overlaps[i] >= cfg.TRAIN.FG_THRESH).view(-1) RuntimeError: CUDA error: device-side assert triggered

forever208 avatar May 13 '21 16:05 forever208

hey @ddshan, have u ever trained the network with batch_size = 4 or others?

forever208 avatar May 13 '21 17:05 forever208

I turned off the cuda, then the practical error is:

Loading pretrained weights from data/pretrained_model/resnet101_caffe.pth Traceback (most recent call last): File "trainval_net.py", line 321, in rois_label, loss_list = fasterRCNN(im_data, im_info, gt_boxes, num_boxes, box_info) File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, **kwargs) File "/content/Hand-Object-Interaction-detection/lib/model/faster_rcnn/faster_rcnn.py", line 62, in forward roi_data = self.RCNN_proposal_target(rois, gt_boxes, num_boxes, box_info) File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, **kwargs) File "/content/Hand-Object-Interaction-detection/lib/model/rpn/proposal_target_layer_cascade.py", line 52, in forward rois_per_image, self._num_classes, box_info) File "/content/Hand-Object-Interaction-detection/lib/model/rpn/proposal_target_layer_cascade.py", line 136, in _sample_rois_pytorch list_box.append(box_info[i][(offset[i,:].view(-1),)]) IndexError: index 20 is out of bounds for dimension 0 with size 20

forever208 avatar May 13 '21 19:05 forever208

I am pretty sure the problem is in the proposal_target_layer_cascade.py, around 170 line

labels = gt_boxes[:, :, 4].contiguous().view(-1)[(offset.view(-1),)].view(batch_size, -1)
        list_box = []
        for i in range(batch_size):
            """error when batch > 1, IndexError: index 20 is out of bounds for dimension 0 with size 20"""
            list_box.append(box_info[i][(offset[i, :].view(-1),)])
        boxes_info = torch.stack(list_box)

forever208 avatar May 16 '21 21:05 forever208

Hi,

We only trained with batch size = 1 due to constraints of our modification on the codebase we followed. Sorry for the inconvenience. Will let you know if we have an improved version.

ddshan avatar Nov 12 '21 18:11 ddshan