PointRCNN icon indicating copy to clipboard operation
PointRCNN copied to clipboard

CUDNN_STATUS_NOT_INITIALIZED with mgpu

Open uzdry opened this issue 5 years ago • 2 comments

Hi,

I really like PointRCNN and wanted to try some things with it out.

When I try to train the RPN with multiple GPUs just the way you describe in the README I get following error.

For some more information: Cuda Version: 10 Cudnn Version: 10 GPUs: 2x 980 TI (non SLI) Pop! OS 19.04

I tried running PointRCNN on both GPUs separately and they worked, so the GPUs themself should work.

It also works if I use a batch-size of just 1. Probably because it then only uses 1 GPU. I also tried various other batch sizes.

This is the error I get when running with --mgpu

Traceback (most recent call last):                                                                                                                                                                                                            
  File "train_rcnn.py", line 250, in <module>
    lr_scheduler_each_iter=(cfg.TRAIN.OPTIMIZER == 'adam_onecycle')
  File "/-redacted-/PointRCNN/tools/../tools/train_utils/train_utils.py", line 199, in train
    loss, tb_dict, disp_dict = self._train_it(batch)
  File "/-redacted-/PointRCNN/tools/../tools/train_utils/train_utils.py", line 132, in _train_it
    loss, tb_dict, disp_dict = self.model_fn(self.model, batch)
  File "/-redacted-/PointRCNN/tools/../lib/net/train_functions.py", line 35, in model_fn
    ret_dict = model(input_data)
  File "/-redacted-/.local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/-redacted-/.local/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 143, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/-redacted-/.local/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 153, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/-redacted-/.local/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 83, in parallel_apply
    raise output
  File "/-redacted-/.local/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 59, in _worker
    output = module(*input, **kwargs)
  File "/-redacted-/.local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/-redacted-/PointRCNN/tools/../lib/net/point_rcnn.py", line 33, in forward
    rpn_output = self.rpn(input_data)
  File "/-redacted-/.local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/-redacted-/PointRCNN/tools/../lib/net/rpn.py", line 74, in forward
    backbone_xyz, backbone_features = self.backbone_net(pts_input)  # (B, N, 3), (B, C, N)
  File "/-redacted-/.local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/-redacted-/PointRCNN/tools/../lib/net/pointnet2_msg.py", line 61, in forward
    li_xyz, li_features = self.SA_modules[i](l_xyz[i], l_features[i])
  File "/-redacted-/.local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/-redacted-/PointRCNN/tools/../pointnet2_lib/pointnet2/pointnet2_modules.py", line 40, in forward
    new_features = self.mlps[i](new_features)  # (B, mlp[-1], npoint, nsample)
  File "/-redacted-/.local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/-redacted-/.local/lib/python3.7/site-packages/torch/nn/modules/container.py", line 92, in forward
    input = module(input)
  File "/-redacted-/.local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/-redacted-/.local/lib/python3.7/site-packages/torch/nn/modules/container.py", line 92, in forward
    input = module(input)
  File "/-redacted-/.local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/-redacted-/.local/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 320, in forward
    self.padding, self.dilation, self.groups)
RuntimeError: cuDNN error: CUDNN_STATUS_NOT_INITIALIZED

uzdry avatar May 23 '19 13:05 uzdry

Sorry, I have no idea about the error since I could not reproduce it. Maybe you could write a simple multiple GPUs code to test your environment first.

sshaoshuai avatar May 24 '19 06:05 sshaoshuai

Hi,

I really like PointRCNN and wanted to try some things with it out.

When I try to train the RPN with multiple GPUs just the way you describe in the README I get following error.

For some more information: Cuda Version: 10 Cudnn Version: 10 GPUs: 2x 980 TI (non SLI) Pop! OS 19.04

I tried running PointRCNN on both GPUs separately and they worked, so the GPUs themself should work.

It also works if I use a batch-size of just 1. Probably because it then only uses 1 GPU. I also tried various other batch sizes.

This is the error I get when running with --mgpu

Traceback (most recent call last):                                                                                                                                                                                                            
  File "train_rcnn.py", line 250, in <module>
    lr_scheduler_each_iter=(cfg.TRAIN.OPTIMIZER == 'adam_onecycle')
  File "/-redacted-/PointRCNN/tools/../tools/train_utils/train_utils.py", line 199, in train
    loss, tb_dict, disp_dict = self._train_it(batch)
  File "/-redacted-/PointRCNN/tools/../tools/train_utils/train_utils.py", line 132, in _train_it
    loss, tb_dict, disp_dict = self.model_fn(self.model, batch)
  File "/-redacted-/PointRCNN/tools/../lib/net/train_functions.py", line 35, in model_fn
    ret_dict = model(input_data)
  File "/-redacted-/.local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/-redacted-/.local/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 143, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/-redacted-/.local/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 153, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/-redacted-/.local/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 83, in parallel_apply
    raise output
  File "/-redacted-/.local/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 59, in _worker
    output = module(*input, **kwargs)
  File "/-redacted-/.local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/-redacted-/PointRCNN/tools/../lib/net/point_rcnn.py", line 33, in forward
    rpn_output = self.rpn(input_data)
  File "/-redacted-/.local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/-redacted-/PointRCNN/tools/../lib/net/rpn.py", line 74, in forward
    backbone_xyz, backbone_features = self.backbone_net(pts_input)  # (B, N, 3), (B, C, N)
  File "/-redacted-/.local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/-redacted-/PointRCNN/tools/../lib/net/pointnet2_msg.py", line 61, in forward
    li_xyz, li_features = self.SA_modules[i](l_xyz[i], l_features[i])
  File "/-redacted-/.local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/-redacted-/PointRCNN/tools/../pointnet2_lib/pointnet2/pointnet2_modules.py", line 40, in forward
    new_features = self.mlps[i](new_features)  # (B, mlp[-1], npoint, nsample)
  File "/-redacted-/.local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/-redacted-/.local/lib/python3.7/site-packages/torch/nn/modules/container.py", line 92, in forward
    input = module(input)
  File "/-redacted-/.local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/-redacted-/.local/lib/python3.7/site-packages/torch/nn/modules/container.py", line 92, in forward
    input = module(input)
  File "/-redacted-/.local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/-redacted-/.local/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 320, in forward
    self.padding, self.dilation, self.groups)
RuntimeError: cuDNN error: CUDNN_STATUS_NOT_INITIALIZED

so have you finally solved this problem?

supercpy avatar Nov 08 '20 07:11 supercpy