second.pytorch icon indicating copy to clipboard operation
second.pytorch copied to clipboard

Could not inference using PointPillars model

Open atinfinity opened this issue 5 years ago • 7 comments

System information (version)

  • Ubuntu 16.04
  • CUDA 10.0
  • PyTorch 1.2.0
  • TorchVision 0.4.0
  • traveller59/second.pytorch https://github.com/traveller59/second.pytorch/commit/3aba19c9688274f75ebb5e576f65cfe54773c021
  • traveller59/spconv https://github.com/traveller59/spconv/commit/6e727bcd17e7d1b72367f664a53f3789f061510e

Detailed description

I trained KITTI dataset and configs/pointpillars/car/xyres_16.config. And, I tried to inference using my trained model.
But, the following error message was displayed.

(12000, 100, 4)
Traceback (most recent call last):
  File "inference.py", line 63, in <module>
    pred = net(example)[0]
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/root/second.pytorch/second/pytorch/models/voxelnet.py", line 363, in forward
    preds_dict = self.network_forward(voxels, num_points, coors, batch_size_dev)
  File "/root/second.pytorch/second/pytorch/models/voxelnet.py", line 335, in network_forward
    preds_dict = self.rpn(spatial_features)
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/root/second.pytorch/second/pytorch/models/rpn.py", line 394, in forward
    res = super().forward(x)
  File "/root/second.pytorch/second/pytorch/models/rpn.py", line 324, in forward
    x = torch.cat(ups, dim=1)
RuntimeError: invalid argument 0: Sizes of tensors must match except in dimension 1. Got 316 and 313 in dimension 2 at /opt/conda/conda-bld/pytorch_1565272279342/work/aten/src/THC/generic/THCTensorMath.cu:71

I found similar issue.https://github.com/traveller59/second.pytorch/issues/175

Steps to reproduce

training

$ python create_data.py kitti_data_prep --root_path=/datasets/kitti
$ python ./pytorch/train.py train --config_path=./configs/pointpillars/car/xyres_16.config --model_dir=/root/model/pp

inference

import numpy as np
import matplotlib.pyplot as plt
import pickle
from pathlib import Path

import torch
from google.protobuf import text_format
from second.utils import simplevis
from second.pytorch.train import build_network
from second.protos import pipeline_pb2
from second.utils import config_tool

config_path = "/root/second.pytorch/second/configs/pointpillars/car/xyres_16.config"
config = pipeline_pb2.TrainEvalPipelineConfig()
with open(config_path, "r") as f:
    proto_str = f.read()
    text_format.Merge(proto_str, config)
input_cfg = config.eval_input_reader
model_cfg = config.model.second
config_tool.change_detection_range_v2(model_cfg, [-50, -50, 50, 50])
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

ckpt_path = "/root/model/pp/voxelnet-9280.tckpt"
net = build_network(model_cfg).to(device).eval()
net.load_state_dict(torch.load(ckpt_path))
target_assigner = net.target_assigner
voxel_generator = net.voxel_generator

grid_size = voxel_generator.grid_size
feature_map_size = grid_size[:2] // config_tool.get_downsample_factor(model_cfg)
feature_map_size = [*feature_map_size, 1][::-1]

anchors = target_assigner.generate_anchors(feature_map_size)["anchors"]
anchors = torch.tensor(anchors, dtype=torch.float32, device=device)
anchors = anchors.view(1, -1, 7)

info_path = input_cfg.dataset.kitti_info_path
root_path = Path(input_cfg.dataset.kitti_root_path)
with open(info_path, 'rb') as f:
    infos = pickle.load(f)

info = infos[564]
v_path = info["point_cloud"]['velodyne_path']
v_path = str(root_path / v_path)
points = np.fromfile(v_path, dtype=np.float32, count=-1).reshape([-1, 4])
res = voxel_generator.generate(points, max_voxels=12000)
voxels, coords, num_points = res['voxels'], res['coordinates'], res['num_points_per_voxel']

print(voxels.shape)
# add batch idx to coords
coords = np.pad(coords, ((0, 0), (1, 0)), mode='constant', constant_values=0)
voxels = torch.tensor(voxels, dtype=torch.float32, device=device)
coords = torch.tensor(coords, dtype=torch.int32, device=device)
num_points = torch.tensor(num_points, dtype=torch.int32, device=device)

example = {
    "anchors": anchors,
    "voxels": voxels,
    "num_points": num_points,
    "coordinates": coords,
}

pred = net(example)[0]

atinfinity avatar Oct 22 '19 01:10 atinfinity

I commented out the following code. As a result, there was no error.

config_tool.change_detection_range_v2(model_cfg, [-50, -50, 50, 50])

I'll investigate the cause of this problem.

atinfinity avatar Oct 24 '19 00:10 atinfinity

The problem lies in cudnn. When you use large batch and large num_max_voxels, the effective batch size for the BatchNorm1d might be too large. One simple way to deal with it is disable cudnn backend:

torch.backends.cudnn.enabled = False

But this solution might increase the memory usage since it will disable cudnn for other modules like Cov2d. A better solution is write a new BatchNorm1d and disable the used of cudnn backend only for this new BatchNorm1d even torch.backends.cudnn.enabled = True.

nywenjing avatar Oct 26 '19 01:10 nywenjing

on my side, I use this version:

class BatchNorm1d_NoCUDNN(nn.Module): _version = 2 constants = ['track_running_stats', 'momentum', 'eps', 'weight', 'bias', 'running_mean', 'running_var', 'num_batches_tracked']

def __init__(self, num_features, eps=1e-5, momentum=0.1, affine=True,
             track_running_stats=True):
    super(BatchNorm1d_NoCUDNN, self).__init__()
    self.num_features = num_features
    self.eps = eps
    self.momentum = momentum
    self.affine = affine
    self.track_running_stats = track_running_stats
    if self.affine:
        self.weight = nn.Parameter(torch.Tensor(num_features))
        self.bias = nn.Parameter(torch.Tensor(num_features))
    else:
        self.register_parameter('weight', None)
        self.register_parameter('bias', None)
    if self.track_running_stats:
        self.register_buffer('running_mean', torch.zeros(num_features))
        self.register_buffer('running_var', torch.ones(num_features))
        self.register_buffer('num_batches_tracked', torch.tensor(0, dtype=torch.long))
    else:
        self.register_parameter('running_mean', None)
        self.register_parameter('running_var', None)
        self.register_parameter('num_batches_tracked', None)
    self.reset_parameters()

def reset_running_stats(self):
    if self.track_running_stats:
        self.running_mean.zero_()
        self.running_var.fill_(1)
        self.num_batches_tracked.zero_()

def reset_parameters(self):
    self.reset_running_stats()
    if self.affine:
        torch.nn.init.uniform_(self.weight)
        torch.nn.init.zeros_(self.bias)

def _check_input_dim(self, input):
    if input.dim() != 2 and input.dim() != 3:
        raise ValueError('expected 2D or 3D input (got {}D input)'
                         .format(input.dim()))

def forward(self, input):
    self._check_input_dim(input)

    exponential_average_factor = 0.0

    if self.training and self.track_running_stats:
        # TODO: if statement only here to tell the jit to skip emitting this when it is None
        if self.num_batches_tracked is not None:
            self.num_batches_tracked += 1
            if self.momentum is None:  # use cumulative moving average
                exponential_average_factor = 1.0 / float(self.num_batches_tracked)
            else:  # use exponential moving average
                exponential_average_factor = self.momentum
    if self.training:
        size = input.size()
        # XXX: JIT script does not support the reduce from functools, and mul op is a
        # builtin, which cannot be used as a value to a func yet, so rewrite this size
        # check to a simple equivalent for loop
        #
        # TODO: make use of reduce like below when JIT is ready with the missing features:
        # from operator import mul
        # from functools import reduce
        #
        #   if reduce(mul, size[2:], size[0]) == 1
        size_prods = size[0]
        for i in range(len(size) - 2):
            size_prods *= size[i + 2]
        if size_prods == 1:
            raise ValueError('Expected more than 1 value per channel when training, got input size {}'.format(size))

    return torch.batch_norm(
        input, self.weight, self.bias, self.running_mean, self.running_var,
        self.training  or not self.track_running_stats, exponential_average_factor, self.eps, False or input.dtype == torch.float16,
    )

def extra_repr(self):
    return '{num_features}, eps={eps}, momentum={momentum}, affine={affine}, ' \
           'track_running_stats={track_running_stats}'.format(**self.__dict__)

def _load_from_state_dict(self, state_dict, prefix, local_metadata, strict,
                          missing_keys, unexpected_keys, error_msgs):
    version = local_metadata.get('version', None)

    if (version is None or version < 2) and self.track_running_stats:
        # at version 2: added num_batches_tracked buffer
        #               this should have a default value of 0
        num_batches_tracked_key = prefix + 'num_batches_tracked'
        if num_batches_tracked_key not in state_dict:
            state_dict[num_batches_tracked_key] = torch.tensor(0, dtype=torch.long)

    super(BatchNorm1d_NoCUDNN, self)._load_from_state_dict(
        state_dict, prefix, local_metadata, strict,
        missing_keys, unexpected_keys, error_msgs)

The only difference to the original Pytorch BatchNorm1d is the last parameter of torch.batch_norm( input, self.weight, self.bias, self.running_mean, self.running_var, self.training or not self.track_running_stats, exponential_average_factor, self.eps, False or input.dtype == torch.float16, ) where the original pytorch use torch.backends.cudnn.enabled

nywenjing avatar Oct 26 '19 01:10 nywenjing

@nydragon Thank you for your reply. I'll try it.

atinfinity avatar Oct 27 '19 06:10 atinfinity

The problem lies in cudnn. When you use large batch and large num_max_voxels, the effective batch size for the BatchNorm1d might be too large. One simple way to deal with it is disable cudnn backend:

torch.backends.cudnn.enabled = False

But this solution might increase the memory usage since it will disable cudnn for other modules like Cov2d. A better solution is write a new BatchNorm1d and disable the used of cudnn backend only for this new BatchNorm1d even torch.backends.cudnn.enabled = True.

this does not work for me. i add 'orch.backends.cudnn.enabled = False' at top of inference code.

I commented out the following code. As a result, there was no error.

config_tool.change_detection_range_v2(model_cfg, [-50, -50, 50, 50])

I'll investigate the cause of this problem.

this works for me

sdu2011 avatar May 15 '20 02:05 sdu2011

Hey i'm having this same issue now. I can't disable cudnn as it uses up all the RAM. Commenting out change_detection_range_v2 allows it to run, but I still need to run on the full 360 point cloud.

hulsmeier avatar Aug 17 '20 17:08 hulsmeier

@atinfinity Hi, thanks for your share, I also wanna do the inference using Pointpillars, but I find there have no all_classes_train file in Pointpillars while SECOND have all_fhd.config. Do you know some ways to solve multi-classes inference by using Pointpillars?

ryontang avatar Jan 14 '21 13:01 ryontang