PointRCNN
PointRCNN copied to clipboard
CUDNN_STATUS_NOT_INITIALIZED with mgpu
Hi,
I really like PointRCNN and wanted to try some things with it out.
When I try to train the RPN with multiple GPUs just the way you describe in the README I get following error.
For some more information: Cuda Version: 10 Cudnn Version: 10 GPUs: 2x 980 TI (non SLI) Pop! OS 19.04
I tried running PointRCNN on both GPUs separately and they worked, so the GPUs themself should work.
It also works if I use a batch-size of just 1. Probably because it then only uses 1 GPU. I also tried various other batch sizes.
This is the error I get when running with --mgpu
Traceback (most recent call last):
File "train_rcnn.py", line 250, in <module>
lr_scheduler_each_iter=(cfg.TRAIN.OPTIMIZER == 'adam_onecycle')
File "/-redacted-/PointRCNN/tools/../tools/train_utils/train_utils.py", line 199, in train
loss, tb_dict, disp_dict = self._train_it(batch)
File "/-redacted-/PointRCNN/tools/../tools/train_utils/train_utils.py", line 132, in _train_it
loss, tb_dict, disp_dict = self.model_fn(self.model, batch)
File "/-redacted-/PointRCNN/tools/../lib/net/train_functions.py", line 35, in model_fn
ret_dict = model(input_data)
File "/-redacted-/.local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in __call__
result = self.forward(*input, **kwargs)
File "/-redacted-/.local/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 143, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/-redacted-/.local/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 153, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/-redacted-/.local/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 83, in parallel_apply
raise output
File "/-redacted-/.local/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 59, in _worker
output = module(*input, **kwargs)
File "/-redacted-/.local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in __call__
result = self.forward(*input, **kwargs)
File "/-redacted-/PointRCNN/tools/../lib/net/point_rcnn.py", line 33, in forward
rpn_output = self.rpn(input_data)
File "/-redacted-/.local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in __call__
result = self.forward(*input, **kwargs)
File "/-redacted-/PointRCNN/tools/../lib/net/rpn.py", line 74, in forward
backbone_xyz, backbone_features = self.backbone_net(pts_input) # (B, N, 3), (B, C, N)
File "/-redacted-/.local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in __call__
result = self.forward(*input, **kwargs)
File "/-redacted-/PointRCNN/tools/../lib/net/pointnet2_msg.py", line 61, in forward
li_xyz, li_features = self.SA_modules[i](l_xyz[i], l_features[i])
File "/-redacted-/.local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in __call__
result = self.forward(*input, **kwargs)
File "/-redacted-/PointRCNN/tools/../pointnet2_lib/pointnet2/pointnet2_modules.py", line 40, in forward
new_features = self.mlps[i](new_features) # (B, mlp[-1], npoint, nsample)
File "/-redacted-/.local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in __call__
result = self.forward(*input, **kwargs)
File "/-redacted-/.local/lib/python3.7/site-packages/torch/nn/modules/container.py", line 92, in forward
input = module(input)
File "/-redacted-/.local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in __call__
result = self.forward(*input, **kwargs)
File "/-redacted-/.local/lib/python3.7/site-packages/torch/nn/modules/container.py", line 92, in forward
input = module(input)
File "/-redacted-/.local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in __call__
result = self.forward(*input, **kwargs)
File "/-redacted-/.local/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 320, in forward
self.padding, self.dilation, self.groups)
RuntimeError: cuDNN error: CUDNN_STATUS_NOT_INITIALIZED
Sorry, I have no idea about the error since I could not reproduce it. Maybe you could write a simple multiple GPUs code to test your environment first.
Hi,
I really like PointRCNN and wanted to try some things with it out.
When I try to train the RPN with multiple GPUs just the way you describe in the README I get following error.
For some more information: Cuda Version: 10 Cudnn Version: 10 GPUs: 2x 980 TI (non SLI) Pop! OS 19.04
I tried running PointRCNN on both GPUs separately and they worked, so the GPUs themself should work.
It also works if I use a batch-size of just 1. Probably because it then only uses 1 GPU. I also tried various other batch sizes.
This is the error I get when running with --mgpu
Traceback (most recent call last): File "train_rcnn.py", line 250, in <module> lr_scheduler_each_iter=(cfg.TRAIN.OPTIMIZER == 'adam_onecycle') File "/-redacted-/PointRCNN/tools/../tools/train_utils/train_utils.py", line 199, in train loss, tb_dict, disp_dict = self._train_it(batch) File "/-redacted-/PointRCNN/tools/../tools/train_utils/train_utils.py", line 132, in _train_it loss, tb_dict, disp_dict = self.model_fn(self.model, batch) File "/-redacted-/PointRCNN/tools/../lib/net/train_functions.py", line 35, in model_fn ret_dict = model(input_data) File "/-redacted-/.local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in __call__ result = self.forward(*input, **kwargs) File "/-redacted-/.local/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 143, in forward outputs = self.parallel_apply(replicas, inputs, kwargs) File "/-redacted-/.local/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 153, in parallel_apply return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)]) File "/-redacted-/.local/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 83, in parallel_apply raise output File "/-redacted-/.local/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 59, in _worker output = module(*input, **kwargs) File "/-redacted-/.local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in __call__ result = self.forward(*input, **kwargs) File "/-redacted-/PointRCNN/tools/../lib/net/point_rcnn.py", line 33, in forward rpn_output = self.rpn(input_data) File "/-redacted-/.local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in __call__ result = self.forward(*input, **kwargs) File "/-redacted-/PointRCNN/tools/../lib/net/rpn.py", line 74, in forward backbone_xyz, backbone_features = self.backbone_net(pts_input) # (B, N, 3), (B, C, N) File "/-redacted-/.local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in __call__ result = self.forward(*input, **kwargs) File "/-redacted-/PointRCNN/tools/../lib/net/pointnet2_msg.py", line 61, in forward li_xyz, li_features = self.SA_modules[i](l_xyz[i], l_features[i]) File "/-redacted-/.local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in __call__ result = self.forward(*input, **kwargs) File "/-redacted-/PointRCNN/tools/../pointnet2_lib/pointnet2/pointnet2_modules.py", line 40, in forward new_features = self.mlps[i](new_features) # (B, mlp[-1], npoint, nsample) File "/-redacted-/.local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in __call__ result = self.forward(*input, **kwargs) File "/-redacted-/.local/lib/python3.7/site-packages/torch/nn/modules/container.py", line 92, in forward input = module(input) File "/-redacted-/.local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in __call__ result = self.forward(*input, **kwargs) File "/-redacted-/.local/lib/python3.7/site-packages/torch/nn/modules/container.py", line 92, in forward input = module(input) File "/-redacted-/.local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in __call__ result = self.forward(*input, **kwargs) File "/-redacted-/.local/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 320, in forward self.padding, self.dilation, self.groups) RuntimeError: cuDNN error: CUDNN_STATUS_NOT_INITIALIZED
so have you finally solved this problem?