second.pytorch
second.pytorch copied to clipboard
Could not inference using PointPillars model
System information (version)
- Ubuntu 16.04
- CUDA 10.0
- PyTorch 1.2.0
- TorchVision 0.4.0
- traveller59/second.pytorch https://github.com/traveller59/second.pytorch/commit/3aba19c9688274f75ebb5e576f65cfe54773c021
- traveller59/spconv https://github.com/traveller59/spconv/commit/6e727bcd17e7d1b72367f664a53f3789f061510e
Detailed description
I trained KITTI dataset and configs/pointpillars/car/xyres_16.config
.
And, I tried to inference using my trained model.
But, the following error message was displayed.
(12000, 100, 4)
Traceback (most recent call last):
File "inference.py", line 63, in <module>
pred = net(example)[0]
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in __call__
result = self.forward(*input, **kwargs)
File "/root/second.pytorch/second/pytorch/models/voxelnet.py", line 363, in forward
preds_dict = self.network_forward(voxels, num_points, coors, batch_size_dev)
File "/root/second.pytorch/second/pytorch/models/voxelnet.py", line 335, in network_forward
preds_dict = self.rpn(spatial_features)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in __call__
result = self.forward(*input, **kwargs)
File "/root/second.pytorch/second/pytorch/models/rpn.py", line 394, in forward
res = super().forward(x)
File "/root/second.pytorch/second/pytorch/models/rpn.py", line 324, in forward
x = torch.cat(ups, dim=1)
RuntimeError: invalid argument 0: Sizes of tensors must match except in dimension 1. Got 316 and 313 in dimension 2 at /opt/conda/conda-bld/pytorch_1565272279342/work/aten/src/THC/generic/THCTensorMath.cu:71
I found similar issue.https://github.com/traveller59/second.pytorch/issues/175
Steps to reproduce
training
$ python create_data.py kitti_data_prep --root_path=/datasets/kitti
$ python ./pytorch/train.py train --config_path=./configs/pointpillars/car/xyres_16.config --model_dir=/root/model/pp
inference
import numpy as np
import matplotlib.pyplot as plt
import pickle
from pathlib import Path
import torch
from google.protobuf import text_format
from second.utils import simplevis
from second.pytorch.train import build_network
from second.protos import pipeline_pb2
from second.utils import config_tool
config_path = "/root/second.pytorch/second/configs/pointpillars/car/xyres_16.config"
config = pipeline_pb2.TrainEvalPipelineConfig()
with open(config_path, "r") as f:
proto_str = f.read()
text_format.Merge(proto_str, config)
input_cfg = config.eval_input_reader
model_cfg = config.model.second
config_tool.change_detection_range_v2(model_cfg, [-50, -50, 50, 50])
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
ckpt_path = "/root/model/pp/voxelnet-9280.tckpt"
net = build_network(model_cfg).to(device).eval()
net.load_state_dict(torch.load(ckpt_path))
target_assigner = net.target_assigner
voxel_generator = net.voxel_generator
grid_size = voxel_generator.grid_size
feature_map_size = grid_size[:2] // config_tool.get_downsample_factor(model_cfg)
feature_map_size = [*feature_map_size, 1][::-1]
anchors = target_assigner.generate_anchors(feature_map_size)["anchors"]
anchors = torch.tensor(anchors, dtype=torch.float32, device=device)
anchors = anchors.view(1, -1, 7)
info_path = input_cfg.dataset.kitti_info_path
root_path = Path(input_cfg.dataset.kitti_root_path)
with open(info_path, 'rb') as f:
infos = pickle.load(f)
info = infos[564]
v_path = info["point_cloud"]['velodyne_path']
v_path = str(root_path / v_path)
points = np.fromfile(v_path, dtype=np.float32, count=-1).reshape([-1, 4])
res = voxel_generator.generate(points, max_voxels=12000)
voxels, coords, num_points = res['voxels'], res['coordinates'], res['num_points_per_voxel']
print(voxels.shape)
# add batch idx to coords
coords = np.pad(coords, ((0, 0), (1, 0)), mode='constant', constant_values=0)
voxels = torch.tensor(voxels, dtype=torch.float32, device=device)
coords = torch.tensor(coords, dtype=torch.int32, device=device)
num_points = torch.tensor(num_points, dtype=torch.int32, device=device)
example = {
"anchors": anchors,
"voxels": voxels,
"num_points": num_points,
"coordinates": coords,
}
pred = net(example)[0]
I commented out the following code. As a result, there was no error.
config_tool.change_detection_range_v2(model_cfg, [-50, -50, 50, 50])
I'll investigate the cause of this problem.
The problem lies in cudnn. When you use large batch and large num_max_voxels, the effective batch size for the BatchNorm1d might be too large. One simple way to deal with it is disable cudnn backend:
torch.backends.cudnn.enabled = False
But this solution might increase the memory usage since it will disable cudnn for other modules like Cov2d. A better solution is write a new BatchNorm1d and disable the used of cudnn backend only for this new BatchNorm1d even torch.backends.cudnn.enabled = True.
on my side, I use this version:
class BatchNorm1d_NoCUDNN(nn.Module): _version = 2 constants = ['track_running_stats', 'momentum', 'eps', 'weight', 'bias', 'running_mean', 'running_var', 'num_batches_tracked']
def __init__(self, num_features, eps=1e-5, momentum=0.1, affine=True,
track_running_stats=True):
super(BatchNorm1d_NoCUDNN, self).__init__()
self.num_features = num_features
self.eps = eps
self.momentum = momentum
self.affine = affine
self.track_running_stats = track_running_stats
if self.affine:
self.weight = nn.Parameter(torch.Tensor(num_features))
self.bias = nn.Parameter(torch.Tensor(num_features))
else:
self.register_parameter('weight', None)
self.register_parameter('bias', None)
if self.track_running_stats:
self.register_buffer('running_mean', torch.zeros(num_features))
self.register_buffer('running_var', torch.ones(num_features))
self.register_buffer('num_batches_tracked', torch.tensor(0, dtype=torch.long))
else:
self.register_parameter('running_mean', None)
self.register_parameter('running_var', None)
self.register_parameter('num_batches_tracked', None)
self.reset_parameters()
def reset_running_stats(self):
if self.track_running_stats:
self.running_mean.zero_()
self.running_var.fill_(1)
self.num_batches_tracked.zero_()
def reset_parameters(self):
self.reset_running_stats()
if self.affine:
torch.nn.init.uniform_(self.weight)
torch.nn.init.zeros_(self.bias)
def _check_input_dim(self, input):
if input.dim() != 2 and input.dim() != 3:
raise ValueError('expected 2D or 3D input (got {}D input)'
.format(input.dim()))
def forward(self, input):
self._check_input_dim(input)
exponential_average_factor = 0.0
if self.training and self.track_running_stats:
# TODO: if statement only here to tell the jit to skip emitting this when it is None
if self.num_batches_tracked is not None:
self.num_batches_tracked += 1
if self.momentum is None: # use cumulative moving average
exponential_average_factor = 1.0 / float(self.num_batches_tracked)
else: # use exponential moving average
exponential_average_factor = self.momentum
if self.training:
size = input.size()
# XXX: JIT script does not support the reduce from functools, and mul op is a
# builtin, which cannot be used as a value to a func yet, so rewrite this size
# check to a simple equivalent for loop
#
# TODO: make use of reduce like below when JIT is ready with the missing features:
# from operator import mul
# from functools import reduce
#
# if reduce(mul, size[2:], size[0]) == 1
size_prods = size[0]
for i in range(len(size) - 2):
size_prods *= size[i + 2]
if size_prods == 1:
raise ValueError('Expected more than 1 value per channel when training, got input size {}'.format(size))
return torch.batch_norm(
input, self.weight, self.bias, self.running_mean, self.running_var,
self.training or not self.track_running_stats, exponential_average_factor, self.eps, False or input.dtype == torch.float16,
)
def extra_repr(self):
return '{num_features}, eps={eps}, momentum={momentum}, affine={affine}, ' \
'track_running_stats={track_running_stats}'.format(**self.__dict__)
def _load_from_state_dict(self, state_dict, prefix, local_metadata, strict,
missing_keys, unexpected_keys, error_msgs):
version = local_metadata.get('version', None)
if (version is None or version < 2) and self.track_running_stats:
# at version 2: added num_batches_tracked buffer
# this should have a default value of 0
num_batches_tracked_key = prefix + 'num_batches_tracked'
if num_batches_tracked_key not in state_dict:
state_dict[num_batches_tracked_key] = torch.tensor(0, dtype=torch.long)
super(BatchNorm1d_NoCUDNN, self)._load_from_state_dict(
state_dict, prefix, local_metadata, strict,
missing_keys, unexpected_keys, error_msgs)
The only difference to the original Pytorch BatchNorm1d is the last parameter of torch.batch_norm( input, self.weight, self.bias, self.running_mean, self.running_var, self.training or not self.track_running_stats, exponential_average_factor, self.eps, False or input.dtype == torch.float16, ) where the original pytorch use torch.backends.cudnn.enabled
@nydragon Thank you for your reply. I'll try it.
The problem lies in cudnn. When you use large batch and large num_max_voxels, the effective batch size for the BatchNorm1d might be too large. One simple way to deal with it is disable cudnn backend:
torch.backends.cudnn.enabled = False
But this solution might increase the memory usage since it will disable cudnn for other modules like Cov2d. A better solution is write a new BatchNorm1d and disable the used of cudnn backend only for this new BatchNorm1d even torch.backends.cudnn.enabled = True.
this does not work for me. i add 'orch.backends.cudnn.enabled = False' at top of inference code.
I commented out the following code. As a result, there was no error.
config_tool.change_detection_range_v2(model_cfg, [-50, -50, 50, 50])
I'll investigate the cause of this problem.
this works for me
Hey i'm having this same issue now. I can't disable cudnn as it uses up all the RAM. Commenting out change_detection_range_v2 allows it to run, but I still need to run on the full 360 point cloud.
@atinfinity Hi, thanks for your share, I also wanna do the inference using Pointpillars, but I find there have no all_classes_train file in Pointpillars while SECOND have all_fhd.config. Do you know some ways to solve multi-classes inference by using Pointpillars?