CenterPoint Help with inference/training for own test data

Hi, I think it's really amazing the work you've done, congrats!! For my thesis project, I would like to use your network to analyze point cloud data, so right now what I have is only a series of .pcd files that are recordings of vehicles on the highway. A lidar sensor was mounted on a bridge, so the recordings are from a static point of view. Could you please give some hints how to prepare these data as an input to your network to perform the 3d object detection? Do I have to prepare database tables in .json format according to the nuscenes specifications? Is it a must? What other steps are required?

Thank you in advance

Dec 05 '20 23:12 xavidzo

Hi, thanks for your interest.

I will have a large update to the codebase in the coming weeks. And I will make sure to add some code for own data inference in the update.

Dec 06 '20 07:12 tianweiy

hello, there is a python file in tools/single_inference.py, but i do not understand the code in it:

 sub_lidar_topic = [ "/velodyne_points", 
                    "/top/rslidar_points",
                    "/points_raw", 
                    "/lidar_protector/merged_cloud", 
                    "/merged_cloud",
                    "/lidar_top", 
                    "/roi_pclouds"]

sub_ = rospy.Subscriber(sub_lidar_topic[5], PointCloud2, rslidar_callback, queue_size=1, buff_size=2**24)

what does the "/roi_pclouds" mean? Is there any tools for transfering the topic "/velodyne_points" to "/roi_pclouds"? or can I use "/velodyne_points" directly?

Dec 10 '20 02:12 huster-wugj

@huster-wugj this is a community contribution. You can find details here https://github.com/tianweiy/CenterPoint/pull/11 You can't directly use /velodyne_points currently. I will add script to support this later.

Dec 10 '20 05:12 tianweiy

This file represents a ROS node (robotic operating system) for inference using point clouds in ROS. "/roi_pclouds" is the name of the topic the you need to subscribe to , in order to receive a point cloud in ROS. Please note that this code is not oriented for inference from .bin or .pcd files!

Dec 10 '20 17:12 YoushaaMurhij

You can use something like this for .bin files:

    def load_cloud_from_nuscenes_file(pc_f):
        logging.info('loading cloud from: {}'.format(pc_f))
        num_features = 5
        cloud = np.fromfile(pc_f, dtype=np.float32, count=-1).reshape([-1, num_features])
        # last dimension should be the timestamp.
        cloud[:, 4] = 0
        return cloud

    @staticmethod
    def load_cloud_from_deecamp_file(pc_f):
        logging.info('loading cloud from: {}'.format(pc_f))
        num_features = 4
        cloud = np.fromfile(pc_f, dtype=np.float32, count=-1).reshape([-1, num_features])
        # last dimension should be the timestamp.
        cloud = np.hstack((cloud, np.zeros([cloud.shape[0], 1])))
        return cloud

    def predict_on_local_file(self, cloud_file, i):

        # load sample from file
        #self.points = self.load_cloud_from_nuscenes_file(cloud_file)
        #self.points = self.load_cloud_from_deecamp_file(cloud_file)
        self.points = self.load_cloud_from_my_file(cloud_file)

        # prepare input
        voxels, coords, num_points = self.voxel_generator.generate(self.points)
        num_voxels = np.array([voxels.shape[0]], dtype=np.int64)
        grid_size = self.voxel_generator.grid_size
        coords = np.pad(coords, ((0, 0), (1, 0)), mode='constant', constant_values=0)

        voxels = torch.tensor(voxels, dtype=torch.float32, device=self.device)
        coords = torch.tensor(coords, dtype=torch.int32, device=self.device)
        num_points = torch.tensor(num_points, dtype=torch.int32, device=self.device)
        num_voxels = torch.tensor(num_voxels, dtype=torch.int32, device=self.device)

        self.inputs = dict(
            voxels=voxels,
            num_points=num_points,
            num_voxels=num_voxels,
            coordinates=coords,
            shape=[grid_size]
        )

        # predict
        torch.cuda.synchronize()
        tic = time.time()
        with torch.no_grad():
            outputs = self.net(self.inputs, return_loss=False)[0]

        torch.cuda.synchronize()
        logging.info("Prediction time: {:.3f} s".format(time.time() - tic))

        for k, v in outputs.items():
            if k not in [
                "metadata",
            ]:
                outputs[k] = v.to('cpu')

        # visualization
        #print(outputs)
        visual_detection(i, np.transpose(self.points), outputs, conf_th=0.5, show_plot=False, show_3D=False)

Dec 10 '20 17:12 YoushaaMurhij

it's just a ROS topic name, you can change it if have your own rosbag data.

Dec 17 '20 11:12 muzi2045

I guess the problem is solved. Reopen if you still have some issues

Jan 09 '21 20:01 tianweiy

Nevertheless, for me, it is still unclear how I can use centerpoint for training on my own custom dataset Can you guys provide more detailed steps, please? For training do I have to label data in the same format as nuscenes or waymo? Should I adapt the "def create_groundtruth_database" function from det3d.datasets.utils.create_gt_database and define a new function inside "tools/create_data.py" similar to "def nuscenes_data_prep"?

Why do you generate pickle files?

In https://github.com/tianweiy/CenterPoint/blob/master/docs/NUSC.md you mentioned for training the "tools/dist_test.py" should be used... Then what's the purpose of the "tools/train.py" script?

Excuse me, perhaps my python skills are not advanced, that's why I cannot see how I can apply centerpoint to my own lidar pcd files.

(This issue was about inference, now I mean training. If you consider I should open a new issue, just let me know)

Jan 11 '21 16:01 xavidzo

For training do I have to label data in the same format as nuscenes or waymo?

I think both of these are difficult. I think the easiest way is to look at my waymo dataset class and write a wrapper like this (please refer to the waymo.py and loading.py files). You want to have lidar data and then also label the objects in the lidar frame with 7 dim boxes and object class. If you also want to do temporal stuff (multi-sweep model), this will be hard and need some sensor calibrations which I am not familiar with.

Should I adapt the "def create_groundtruth_database" function from det3d.datasets.utils.create_gt_database and define a new function inside "tools/create_data.py" similar to "def nuscenes_data_prep"?

Yes..

Why do you generate pickle files?

To save the calibration/file paths/labels all into a single file for efficient accessing.

In https://github.com/tianweiy/CenterPoint/blob/master/docs/NUSC.md you mentioned for training the "tools/dist_test.py" should be used... Then what's the purpose of the "tools/train.py" script?

dist_test for testing. train for train. Where did I mention that dist_test is for training?

Let me know if you have further questions.

Best

Jan 13 '21 00:01 tianweiy

find the lidar label tool, and define your own label json format or txt.
collect the raw sensor data like lidar, camera, imu, GPS, etc..., you must be doing timestamp sync in hardware for further using.
preprocess the raw sensor data and split the lidar data into frame by frame like .bin(KITTI), pcd or other pointcloud format.
annoate the data and get your own dataset.
write the correct dataset api warper for your own datatset with any training codebase.

Jan 15 '21 01:01 muzi2045

Ok thanks a lot for your helpful indications! I have two questions more:

Since my training set is rather small, approx. 1200 lidar frames or less, I guess it makes sense to begin training loading weights from some of the already available pretrained networks. Do you have maybe any recommendations on hyperparameters setting? Suggestions for how many number of epochs I should train given such small dataset size?
I wish to apply a Point-Painting technique for sequential fusion of camera images and lidar as described here: https://arxiv.org/abs/1911.10150

So basically, after training an image segmentation network and projecting the point cloud points to images in order to find a correspondence between lidar points and pixels, I could figure out the semantic segmentation scores assigned to each lidar point. Then the point-cloud dimensionality must be augmented with these scores. (4 values: x, y, x, intensity plus C class scores). I want to do this with CenterPoint. Can you please give me a hint, where in the codebase I can change the input size of the PointPillars or VoxelNet backbones to realize this? Maybe another adaptions to the codebase I should consider for this PointPainting approach?

Jan 15 '21 10:01 xavidzo

It depend on your scene and the classes you want to detect, 1200 frames is enough to train a toy model for test, but it's really not enough for production deploy(10W+).
about the pointpainting , you can try to modify the preprocess part to attach segment score every pont, and change the vfe layer input channel, In Centerpoint , it trained on Nuscenes config have 5 channel points input(x, y, z, intensity, timestamp), and after computing , there will be 10 channel input to the PFE layer, the most easy way it to attach segment score to input(x, y, z, intensity, timestamp, class_score) and 11 channel to PFE layer, the output will still be MAX_PILLARS_NUM * 64 * MAX_POINTS_PER_PILLAR.

Jan 16 '21 09:01 muzi2045

Thanks a lot @muzi2045

For 1, I don't have much experience.

For 2, actually, now https://github.com/tianweiy/CenterPoint/blob/fc9c9eef5534486c8a3b9e35c4682822bddc3498/det3d/datasets/pipelines/loading.py#L23 get a painted option. I guess you will first paint the points with a segmentation mask, and you can directly use these painted point clouds (only need to change the input_num_features to the corresponding channel number in the configs).

I implemented pointpainting for nuScenes in the past. You will need those calibrations matrixes for camera / lidar and do some good synchronization, which I feel is not trivial. You can get some idea from the following code that uses a mmdetection3d nuImages model. It is quite slow though. So you need to do some optimization.

import argparse
from nuscenes.utils.geometry_utils import view_points
import numpy as np 
import numba 
import pickle 
from tqdm import tqdm 
from mmdet.apis import inference_detector, init_detector
import torch 
import time 
import mmcv 

import platform
from functools import partial
import os 
from det3d.torchie.parallel import collate, collate_kitti
from det3d.torchie.trainer import get_dist_info
from torch.utils.data import DataLoader

from det3d.datasets.loader.sampler import (
    DistributedGroupSampler,
    DistributedSampler,
    DistributedSamplerV2,
    GroupSampler,
)
from collections import defaultdict 
from mmdet.datasets.pipelines import LoadImageFromFile
from det3d import torchie

if platform.system() != "Windows":
    # https://github.com/pytorch/pytorch/issues/973
    import resource

    rlimit = resource.getrlimit(resource.RLIMIT_NOFILE)
    resource.setrlimit(resource.RLIMIT_NOFILE, (4096, rlimit[1]))


FILE_CLIENT = mmcv.FileClient(backend='disk')

def load_image(filename):
    global FILE_CLIENT
    img_bytes = FILE_CLIENT.get(filename)
    img = mmcv.imfrombytes(img_bytes, flag='color')
    img = img.astype(np.float32)

    return img 

def simple_collate(batch_list):
    ret_dict = defaultdict(list)
    for i in range(len(batch_list)):
        ret_dict['lidars'].append(batch_list[i]['lidars'])
        ret_dict['imgs'].append(batch_list[i]['imgs'])
        ret_dict['sweeps'].append(batch_list[i]['sweeps'])

    return ret_dict 

def build_dataloader(
    dataset, batch_size, workers_per_gpu, num_gpus=1, dist=True, **kwargs
):
    
    data_loader = DataLoader(
        dataset,
        batch_size=1,
        sampler=None,
        shuffle=False,
        num_workers=4,
        collate_fn=simple_collate,
        pin_memory=False,
    )

    return data_loader 

def get_time():
    torch.cuda.synchronize()
    return time.time()

def read_file(path, tries=2, num_point_feature=5):
    points = None
    try_cnt = 0
    points = np.fromfile(path, dtype=np.float32).reshape(-1, 5)[:, :num_point_feature]

    return points


def get_obj(path):
    with open(path, 'rb') as f:
            obj = pickle.load(f)
    return obj 

def to_tensor(x, device='cuda:0'):
    return torch.tensor(x, dtype=torch.float32).to(device)

def view_points(points, view, normalize, device='cuda:0'):
    """
    This is a helper class that maps 3d points to a 2d plane. It can be used to implement both perspective and
    orthographic projections. It first applies the dot product between the points and the view. By convention,
    the view should be such that the data is projected onto the first 2 axis. It then optionally applies a
    normalization along the third dimension.

    For a perspective projection the view should be a 3x3 camera matrix, and normalize=True
    For an orthographic projection with translation the view is a 3x4 matrix and normalize=False
    For an orthographic projection without translation the view is a 3x3 matrix (optionally 3x4 with last columns
     all zeros) and normalize=False

    :param points: <np.float32: 3, n> Matrix of points, where each point (x, y, z) is along each column.
    :param view: <np.float32: n, n>. Defines an arbitrary projection (n <= 4).
        The projection should be such that the corners are projected onto the first 2 axis.
    :param normalize: Whether to normalize the remaining coordinate (along the third axis).
    :return: <np.float32: 3, n>. Mapped point. If normalize=False, the third coordinate is the height.
    """

    assert view.shape[0] <= 4
    assert view.shape[1] <= 4
    assert points.shape[0] == 3

    viewpad = torch.eye(4).to(device)
    viewpad[:view.shape[0], :view.shape[1]] = view

    nbr_points = points.shape[1]

    # Do operation in homogenous coordinates.
    points = torch.cat((points, torch.ones([1, nbr_points]).to(device)), dim=0)

    points = torch.matmul(viewpad, points)

    points = points[:3, :]

    if normalize:
        points = points / points[2:3, :].repeat(3, 1).reshape(3, nbr_points)

    return points

@torch.no_grad()
def paint(points, segmentations, all_cams_from_lidar, all_cams_intrinsic, num_cls=10, device='cuda:0'):
    num_lidar_point = points.shape[0]
    num_camera = len(all_cams_from_lidar)
    H, W, num_cls = segmentations[0].shape 

    # we keep a final mask to keep track all points that can project into an image
    # we set a point's rgb value to (0, 0, 0) if no projection is possible 
    # not quite sure how to deal with points that appear in multiple cameras. Now we include all of them and take the average 

    final_mask = torch.zeros([num_lidar_point], dtype=torch.bool).to(device)
    segmentation_map = torch.zeros([num_lidar_point, num_cls], dtype=torch.float32).to(device) 
    num_occurence = torch.zeros([num_lidar_point], dtype=torch.float32).to(device)  # keep track of how many pixels a point is projected into 
    points = to_tensor(points)


    for i in range(num_camera):
        tm = to_tensor(all_cams_from_lidar[i])
        intrinsic = to_tensor(all_cams_intrinsic[i])
        segmentation = to_tensor(segmentations[i])

        # transfer to camera frame 
        transform_points  = points.clone().transpose(1, 0)
        transform_points[:3, :] = torch.matmul(tm, torch.cat([
                transform_points[:3, :], 
                torch.ones(1, num_lidar_point, dtype=torch.float32, device=device)
            ], dim=0)
        )[:3, :]

        depths = transform_points[2, :]

        # take 2d image of the lidar point 
        points_2d = view_points(transform_points[:3, :], intrinsic, normalize=True).transpose(1, 0)[:, :2]

        points_seg = torch.zeros([num_lidar_point, num_cls], dtype=points.dtype, device=device)

        # N x 2 
        points_2d = torch.floor(points_2d)
        assert H == 900, W == 1600 
        
        points_x, points_y = points_2d[:, 0].long(), points_2d[:, 1].long()

        valid_mask = (points_x > 1) & (points_x < W-1) & (points_y >1) & (points_y < H-1)
        points_seg[valid_mask] = segmentation[points_y[valid_mask], points_x[valid_mask]]

        # ensure the points are in front of the camera
        mask = valid_mask & (depths > 0)
        
        segmentation_map[mask] += points_seg[mask]
        final_mask = mask | final_mask
        num_occurence += mask.float() 

    # for points that have at least one projection, we compute the average segmentation score
    projection_points = points[final_mask]
    segmentation_map = segmentation_map[final_mask]  / num_occurence[final_mask].unsqueeze(-1)  
    projection_points = torch.cat([projection_points, segmentation_map], dim=1)    

    # for points that are not in any images, we assign background class to the point 
    no_projection_points = points[torch.logical_not(final_mask)]
    fake_seg = torch.zeros([no_projection_points.shape[0], num_cls], dtype=torch.float32, device=device) 
    no_projection_points = torch.cat([no_projection_points, fake_seg], dim=1)

    points = torch.cat([projection_points, no_projection_points], dim=0)

    return points

def process_seg_result(result):
    # 10 x NUM_BOX x H x W 
    segs = [] 
    for i in range(len(result)):
        per_cls_seg = result[i]
        if len(per_cls_seg) == 0:
            per_cls_seg = np.zeros((900, 1600)) 
        else:
            per_cls_seg = np.stack(per_cls_seg, axis=0)
            # merge all box's instance mask 
            # if one pixel is in any of the boxes, we will set this pixel's value to 1
            # otherwise set it to zero 

            # N H W 
            per_cls_seg = np.mean(per_cls_seg, axis=0)
            per_cls_seg = (per_cls_seg>0).astype(np.float32)

        segs.append(per_cls_seg)

    # 10 x H x W 
    seg_mask = np.stack(segs, axis=0)

    seg_mask = np.transpose(seg_mask, (1, 2, 0)) # H x W x 10 
    return seg_mask 

from torch.utils.data import Dataset 
class PaintData(Dataset):
    def __init__(
        self,
        info_path,
        start=-1,
        end=-1
    ):
        if end == -1:
            self.infos = get_obj(info_path)[start:]
        elif start == -1:
            self.infos = get_obj(info_path)[:end+1]
        else:
            self.infos = get_obj(info_path)[start:end+1]
        self._set_group_flag()

    def __getitem__(self, index):
        info = self.infos[index]
        sweeps = [info] + info['sweeps']

        lidars = [] 
        imgs = [] 
        assert len(sweeps) == 10 
        for i in range(10):
            sweep = sweeps[i]
            lidar_path = sweep['lidar_path']
            lidar = read_file(lidar_path)          
            lidars.append(lidar)

            all_cams_path = sweep['all_cams_path'] 
            img = [] 
            for path in all_cams_path: 
                img.append(load_image(path))

            # img = np.stack(img, axis=0)
            imgs.append(img)        

        ret_dict = {
            'lidars': lidars, 
            'sweeps': sweeps,
            'imgs': imgs 
        }

        return ret_dict  

    def __len__(self):
        return len(self.infos)

    def _set_group_flag(self):
        """Set flag according to image aspect ratio.
        Images with aspect ratio greater than 1 will be set as group 1,
        otherwise group 0.
        """
        self.flag = np.ones(len(self), dtype=np.uint8)
        # self.flag = np.zeros(len(self), dtype=np.uint8)
        # for i in range(len(self)):
        #     img_info = self.img_infos[i]
        #     if img_info['width'] / img_info['height'] > 1:
        #         self.flag[i] = 1

@torch.no_grad()
def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("--info_path", type=str)
    parser.add_argument('--config', help='Config file')
    parser.add_argument('--checkpoint', help='Checkpoint file')
    parser.add_argument('--device', default='cuda:0', type=str)
    parser.add_argument('--start', type=int, default=0, )
    parser.add_argument('--end', type=int)
    args = parser.parse_args()

    dataset = PaintData(args.info_path, args.start, args.end)
    model = init_detector(args.config, args.checkpoint, device=args.device)

    loader = build_dataloader(dataset, 1, 4, dist=False)
    prog_bar = torchie.ProgressBar(len(dataset))

    for _, data_batch in enumerate(loader):
        lidars, sweeps, imgs = data_batch['lidars'], data_batch['sweeps'], data_batch['imgs']

        lidars = lidars[0]
        sweeps = sweeps[0]
        imgs = imgs[0]

        for i in range(10):
            sweep = sweeps[i]
            lidar_path = sweep['lidar_path']
            lidar = lidars[i]
            img = imgs[i]

            all_cams_from_lidar = sweep['all_cams_from_lidar']
            all_cams_intrinsic = sweep['all_cams_intrinsic']
            all_cams_path = sweep['all_cams_path'] 
            instance_seg_results = [] 

            for j in range(6):
                result = inference_detector(model, img[j])[1]
                seg_mask = process_seg_result(result)
                instance_seg_results.append(seg_mask)

            # run painting 
            painted_points = paint(lidar, instance_seg_results, all_cams_from_lidar, all_cams_intrinsic)

            dir_path = os.path.join(*lidar_path.split('/')[:-2], 'stronger_painted_'+lidar_path.split('/')[-2])

            if not os.path.isdir(dir_path):
                os.mkdir(dir_path)

            painted_path = os.path.join(dir_path, lidar_path.split('/')[-1])      

            np.save(painted_path, painted_points.cpu().numpy())

        prog_bar.update()

if __name__ == '__main__':
    main()

Jan 17 '21 02:01 tianweiy

https://github.com/tianweiy/CenterPoint/blob/84fde67ac35a85b0364e10e75bfdf789d578024f/det3d/datasets/pipelines/loading.py#L1

Feb 19 '21 15:02 tianweiy

Please correct me if I am wrong: a file like "infos_train_01sweeps_filter_zero_gt.pkl", has saved the aggregated ground truth information for all training frames at once, right? What's the meaning of the "filter_zero" in the filename? What's the difference between "dbinfos_train_1sweeps_withvelo.pkl" and "infos_train_01sweeps_filter_zero_gt.pkl"? Is it mandatory to create these files for a simple custom dataset?

In my case, I have all pcd files in one folder "pointclouds" and in another folder "annotations" I have the corresponding .json files with the labeled vehicles for each pcd / frame. One such .json file looks for example as follows:

{"name":"000000","timestamp":0,"index":0,"labels":[{"id":0,"category":"car","box3d":{"dimension":{"width":3.52,"length":1.9000000000000001,"height":1.58},"location":{"x":13.990062100364485,"y":17.122656041496878,"z":-6.110000000000003},"orientation":{"rotationYaw":0,"rotationPitch":0,"rotationRoll":0}}},{"id":1,"category":"car","box3d":{"dimension":{"width":3.41,"length":1.68,"height":1.58},"location":{"x":27.212441928732442,"y":-2.479418392114875,"z":-6.110000000000001},"orientation":{"rotationYaw":0,"rotationPitch":0,"rotationRoll":0}}}

So my guess is I should create one "info_train.pkl" that contains at least the aggregated data of all .json files, plus the filepath directory to all pcd files. Do I need to store more information in this pkl? I think I do not need to worry about frame transformations of any kind. My scenario is different from the ego-vehicle case since the lidar sensor was mounted on a bridge and remained static pointing to the highway. The global frame is in my case also the lidar frame. In the next image, a recording sequence is shown: ouster_OS1-64

I checked in nuScenes or Waymo, their labels' structure is very complex, they include far many more attributes than I have in the .json labels for my custom data. I see you include "num_points_in_gt" in the info.pkl (I assume that means "number of points inside each ground truth bounding box"). I don't have such attribute in my labels.

Based on my .json example above, will the object data attributes I collected be enough to perform training, or am I missing something else?

Feb 19 '21 18:02 xavidzo

What's the meaning of the "filter_zero" in the filename?

filter out boxes with no lidar points

What's the difference between "dbinfos_train_1sweeps_withvelo.pkl" and "infos_train_01sweeps_filter_zero_gt.pkl"? Is it mandatory to create these files for a simple custom dataset?

infos for annotations, dbinfos for gt database. If you want to use gt aug, you need to generate dbinfos like we do in the create_data.py

I think the collected data are enough. 'num_points_in_gt' is not needed.

Feb 21 '21 19:02 tianweiy

Hi, I have a couple more questions, if you could answer them please:

What is the sweeps attribute useful for? Why is it 10 on nuscenes and 1 or 2 on waymo? For using sweeps, each lidar frame must have a timestamp, otherwise it cannot be implemented, right?
I assume both nuscenes and waymo use 5 lidar features, which in the code translates to num_point_features = 5 = x, y, z, intensity and t (timestamp), correct ? But here in the code it seems that at least for nuscenes you only load the first 4, so x, y, z, and intensity:

https://github.com/tianweiy/CenterPoint/blob/84fde67ac35a85b0364e10e75bfdf789d578024f/det3d/datasets/pipelines/loading.py#L23

Looking at the next piece of code, so you transform from Waymo to kitti reference frame convention because the network backbone expects the input to be in kitti format? https://github.com/tianweiy/CenterPoint/blob/84fde67ac35a85b0364e10e75bfdf789d578024f/det3d/datasets/waymo/waymo_common.py#L265

do you make the same transform for nuscenes as well? I don't see the same operation here: https://github.com/tianweiy/CenterPoint/blob/84fde67ac35a85b0364e10e75bfdf789d578024f/det3d/datasets/nuscenes/nusc_common.py#L403

I read in waymo they labeled the bounding boxes' orientation only with the yaw angle, but when I looked at the info_data there are still 3 different values for orientation in the gt_boxes field, why? To be more specific, in my custom dataset objects are labeled like x, y, z, width (corresponds to x dimension), length (corresponds to y dimension), and height ( = z obvious). Do I need to transform something?

Feb 25 '21 15:02 xavidzo

What is the sweeps attribute useful for? Why is it 10 on nuscenes and 1 or 2 on waymo? For using sweeps, each lidar frame must have a timestamp, otherwise it cannot be implemented, right?

It is precomputed path for lidar frames before the current referenced frame. It is used to aggregate multiframe lidar data. nuScenes lidar is 32 lanes and Waymo is 64 lanes. As nuScenes produces really sparse pointcloud, we use more frames to aggregate data. For Waymo, we can also do 5 or ten but it is slower. Yes

I assume both nuscenes and waymo use 5 lidar features, which in the code translates to num_point_features = 5 = x, y, z, intensity and t (timestamp), correct ? But here in the code it seems that at least for nuscenes you only load the first 4, so x, y, z, and intensity:

For nuScenes, we use x,y,z,intensity and timestamp. For Waymo, we use x, y, z, intensity, elongation. with optional timestamp for two frame model.

do you make the same transform for nuscenes as well? I don't see the same operation here:

This is already done in the construction of nuScenes info file.

I read in waymo they labeled the bounding boxes' orientation only with the yaw angle, but when I looked at the info_data there are still 3 different values for orientation in the gt_boxes field, why?

Not quite sure about 'I looked at the info_data there are still 3 different values for orientation in the gt_boxes field, why' I think there is only one value?

To be more specific, in my custom dataset objects are labeled like x, y, z, width (corresponds to x dimension), length (corresponds to y dimension), and height ( = z obvious). Do I need to transform something?

Please use x, y, z, length (corresponds to y), width(x), z, rotation

Feb 28 '21 03:02 tianweiy

thanks a lot tianweiy as always for taking your time to reply, still have some questions:

a) "For nuScenes, we use x,y,z,intensity and timestamp. For Waymo, we use x, y, z, intensity, elongation. with optional timestamp for two frame model." However, here in the code, it seems like you only select the x, y, and intensity for nuScenes when you read the pcd file in the function read_file() from pipelines/loading.py. because num_point_feature=4 ??:

def read_file(path, tries=2, num_point_feature=4, painted=False):
    if painted:
     //////////////
    else:
          points = np.fromfile(path, dtype=np.float32).reshape(-1, 5)[:, :num_point_feature]

    return points

b) "Not quite sure about 'I looked at the info_data there are still 3 different values for orientation in the gt_boxes field, why' I think there is only one value?". For instance, this a sample of Waymo's infos_train_01sweeps_filter_zero_gt.pkl:

[{'path': 'data/Waymo/train/lidar/seq_0_frame_0.pkl', 'anno_path': 'data/Waymo/train/annos/seq_0_frame_0.pkl', 'token': 'seq_0_frame_0.pkl', 'timestamp': 1553629304.2740898, 'sweeps': [], 'gt_boxes': array([[-7.75765467e+00,  9.75379848e+00,  2.81035209e+00,
         5.00253379e-01,  1.01573773e-01,  4.49999988e-01,
        -3.18337395e-03,  3.55915539e-03, -1.53220856e+00],
       [ 2.29385357e+01, -6.12172985e+00,  2.50068998e+00,
         4.72529471e-01,  6.68555871e-02,  5.29999971e-01,
        -1.27970241e-03, -1.58406922e-03, -4.62591171e+00]
        ......................................................
        ......................................................

In the above sample, in the field 'gt_boxes' there are 9 values for each box, not just 7... An here is a sample of Waymo's dbinfos_train_1sweeps_withvelo.pkl :

{'VEHICLE': [{'name': 'VEHICLE', 'path': 'gt_database_1sweeps_withvelo/VEHICLE/0_VEHICLE_2.bin', 'image_idx': 0, 'gt_idx': 2, 'box3d_lidar': array([-5.8496891e+01,  4.4255123e+00,  2.4241050e-01,  1.7933178e+00,
        3.9103422e+00,  1.4900000e+00, -1.5062687e-01, -1.1980739e-02,
        1.4906577e+00], dtype=float32), 'num_points_in_gt': 51, 'difficulty': 0, 'group_id': 0}, {'name': 'VEHICLE', 'path': 'gt_database_1sweeps_withvelo/VEHICLE/0_VEHICLE_3.bin', 'image_idx': 0, 'gt_idx': 3, 'box3d_lidar': array([ 3.7829960e+01, -5.1944214e-01,  1.1195552e+00,  1.8915721e+00,
        4.2592230e+00,  1.8200001e+00,  2.0176919e-02,  3.6506136e-03,
       -1.6147256e+00], dtype=float32), 'num_points_in_gt': 5, 'difficulty': 0, 'group_id': 1}],

In the field 'box3d_lidar' there are also 9 values. I assume the last three values correspond to angles roll, pitch, yaw. But in theory in waymo they only labeled the yaw angle, so there should be only one value for orientation? ... or at least the other two should be zero, but they are not... hence my question, why?

c) Also, why is your pointpillars-based model on nuscenes faster than the one on waymo (31 fps vs 19 fps) ?

d) Now I am about to start training on my own dataset. Since I have very few frames to train ~ 150 frames, I want to do transfer learning, meaning I want to reuse the weights of the trained model on nuscenes, so I only want to replace and train the last layer of centerpoint, i.e. the detection head. I am looking at the file "nusc_centerpoint_pp_02voxel_circle_nms.py":

tasks = [
    dict(num_class=1, class_names=["car"]),
    dict(num_class=2, class_names=["truck", "construction_vehicle"]),
    dict(num_class=2, class_names=["bus", "trailer"]),
    dict(num_class=1, class_names=["barrier"]),
    dict(num_class=2, class_names=["motorcycle", "bicycle"]),
    dict(num_class=2, class_names=["pedestrian", "traffic_cone"]),
]

can you please explain why you put classes in different dictionaries? I have the intuition the classes in the same dictionary are similar... In my dataset I have classes car, truck, van, bus, pedestrian.... should I put them in different dictionaries as well, or can I make just one dictionary like tasks = [ dict(num_class=4, class_names=["car", "truck", "van", "bus", "pedestrian"]]

Do I need to make more changes in this portion of the config file?

    bbox_head=dict(
        # type='RPNHead',
        type="CenterHead",
        mode="3d",
        in_channels=sum([128, 128, 128]),
        norm_cfg=norm_cfg,
        tasks=tasks,
        dataset='nuscenes',
        weight=0.25,
        code_weights=[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.2, 0.2, 1.0, 1.0],
        common_heads={'reg': (2, 2), 'height': (1, 2), 'dim':(3, 2), 'rot':(2, 2), 'vel': (2, 2)}, # (output_channel, num_conv)
        encode_rad_error_by_sin=False,
        direction_offset=0.0,
        bn=True
    )

what do these "code_weights" values represent? there are 10 values because in nuscenes there are 10 classes(?)....so in my case , 5 classes, I should specify only five "code_weights"?

Mar 01 '21 12:03 xavidzo

a) "For nuScenes, we use x,y,z,intensity and timestamp. For Waymo, we use x, y, z, intensity, elongation. with optional timestamp for two frame model." However, here in the code, it seems like you only select the x, y, and intensity for nuScenes when you read the pcd file in the function read_file() from pipelines/loading.py. because num_point_feature=4 ??:

Yeah, we concatenate the timestamp later https://github.com/tianweiy/CenterPoint/blob/890e538afc95ee741f4c29b4c30b37251027e31c/det3d/datasets/pipelines/loading.py#L138

In the field 'box3d_lidar' there are also 9 values. I assume the last three values correspond to angles roll, pitch, yaw. But in theory in waymo they only labeled the yaw angle, so there should be only one value for orientation? ... or at least the other two should be zero, but they are not... hence my question, why?

Sorry for the confusion, they are not three angles. They are x,y,z,dx,dy,dz,velocity_x, velocity_y, yaw

c) Also, why is your pointpillars-based model on nuscenes faster than the one on waymo (31 fps vs 19 fps) ?

Two reasons. 1. Waymo has larger scene (150x150 vs. 100 x 100meter). 2. The Waymo model's output stride is 1 compared to 4 in nuScenes. (Waymo with output stride 4 doesn't perform well)

d) can you please explain why you put classes in different dictionaries? I have the intuition the classes in the same dictionary are similar... In my dataset I have classes car, truck, van, bus, pedestrian.... should I put them in different dictionaries as well, or can I make just one dictionary like tasks = [ dict(num_class=4, class_names=["car", "truck", "van", "bus", "pedestrian"]]

Basically, we assign different objects to different detection heads to solve the class imbalance problem in nuScenes. If your dataset is not class imbalanced, you can just use a single group like what you do above, otherwise, try to follow our group definition which proves to work well on nuScenes

Do I need to make more changes in this portion of the config file?

no need. code_weights is the weight for regression loss of 'reg', 'height', 'dim', etc.. it is not related to classes

Mar 02 '21 07:03 tianweiy

Probably later on I would like to fuse camera images with the results from CenterPoint, following the state-of-the-art method in CLOCs. The available code released for now fuses the outputs of the 2D detector Cascade-RCNN with the 3D detector SECOND. https://arxiv.org/abs/2009.00784, https://github.com/pangsu0613/CLOCs In the README of the above github link, in the section "Fusion of other 3D and 2D detectors", the author says the following:

Step 3: Since the number of detection candidates are different for different 2D/3D detectors, you need to modify the corresponding parameters in the CLOCs code. Then train the CLOCs fusion. For example, there are 70400 (200x176x2) detection candidates in each frame from SECOND with batch size equals to 1. It is a very large number because SECOND is a one-stage detector, for other multi-stage detectors, you could just take the detection candidates before the final NMS function, that would reduce the number of detection candidates to hundreds or thousands.

a) How would this statement apply to CenterPoint, can you explain please? I mean, what's the number of detection candidates in each frame? Is this number always fixed / the same for all frames? and how can I take the number of detection candidates in CenterPoint before NMS is applied?

b) Update: I've been training on my custom data for more than 100 epochs. My dataset is quite small, only 150 frames. This is the performance so far: 2021-03-07 15:14:11,158 - INFO - Epoch [122/125][35/38] lr: 0.00000, eta: 0:02:21, time: 0.408, data_time: 0.070, transfer_time: 0.004, forward_time: 0.078, loss_parse_time: 0.000 memory: 4080, 2021-03-07 15:14:11,158 - INFO - task : ['car'], loss: 3.0604, hm_loss: 2.5405, loc_loss: 2.0795, loc_loss_elem: ['0.2258', '0.2351', '0.2916', '0.0908', '0.0972', '0.1771', '1.5755', '0.3259', '0.4193', '0.1623'], num_positive: 11.2000 2021-03-07 15:14:11,158 - INFO - task : ['trailer'], loss: 0.8950, hm_loss: 0.6023, loc_loss: 1.1709, loc_loss_elem: ['0.1410', '0.1478', '0.1197', '0.1316', '0.0734', '0.0514', '0.6290', '0.2809', '0.2147', '0.1092'], num_positive: 7.6000 2021-03-07 15:14:11,158 - INFO - task : ['bus'], loss: 1.3275, hm_loss: 1.1275, loc_loss: 0.8001, loc_loss_elem: ['0.1548', '0.1824', '0.0938', '0.0199', '0.0174', '0.0179', '0.0063', '0.0058', '0.2177', '0.0938'], num_positive: 12.8000 2021-03-07 15:14:11,158 - INFO - task : ['van'], loss: 0.7695, hm_loss: 0.5451, loc_loss: 0.8977, loc_loss_elem: ['0.1176', '0.1482', '0.1664', '0.0785', '0.0616', '0.0487', '0.2156', '0.3060', '0.0561', '0.1162'], num_positive: 5.8000 2021-03-07 15:14:11,158 - INFO - task : ['pedestrian'], loss: 0.3599, hm_loss: 0.2008, loc_loss: 0.6362, loc_loss_elem: ['0.2162', '0.1056', '0.1254', '0.0568', '0.0615', '0.0198', '0.0024', '0.0028', '0.0198', '0.0301'], num_positive: 8.0000

Do you have any idea why the loss for the car class does not get lower? I've seen it's mostly around 3 and 4, but for the other classes it started high and has converged rapidly till values around 1. In my dataset, the car class is the most represented, followed by the trailer class whose loss does decrease easily... I wonder why is this is not the case for car class

Mar 02 '21 16:03 xavidzo

a) How would this statement apply to CenterPoint, can you explain please? I mean, what's the number of detection candidates in each frame? Is this number always fixed / the same for all frames? and how can I take the number of detection candidates in CenterPoint before NMS is applied?

Never read this and sorry that I don't have time to read this recently.

b) seems that one value is off a lot [''1.5755', '0.3259', '0.4193', '0.1623'],'] unfortuantelly I don't have much clues

Mar 09 '21 04:03 tianweiy

I am aware that you are quite busy right now, but if you have some free time after ICCV, please try to answer a) of my previous comment. You don't need to read the whole CLOCs paper, only the README in the github link I posted, which would take you no more than 5 minutes. Thank you in advance tianweiy

Mar 09 '21 14:03 xavidzo

For nuScenes, you will use box here https://github.com/tianweiy/CenterPoint/blob/1ecebf980f75cfe7f53cc52032b184192891c9b9/det3d/models/bbox_heads/center_head.py#L481

For Waymo (or if you want to train two stage model), you will use this https://github.com/tianweiy/CenterPoint/blob/1ecebf980f75cfe7f53cc52032b184192891c9b9/det3d/models/detectors/two_stage.py#L149

I am working on adding tutorials and code for easier training on custom datasets. I am also working on accelerating the inference a bit. I will try to merge these two asap but it may take me one week or so.

Mar 21 '21 03:03 tianweiy

check my reply here https://github.com/tianweiy/CenterPoint/issues/78#issuecomment-812951128 though I believe you already finished sth.

Apr 04 '21 01:04 tianweiy

Yes, I can show you in this gif how CenterPoint works after training on my own data recorded from the highway. The learned weights from nuscenes didn't help much, so I trained from scratch. Basically, you can see cars and trucks being detected and visualized in Rviz. Thanks a lot for all your support! centerpoint-highway-prov_github

Two questions, this one maybe off-topic: Is there any reason why you decided to use det3d for your work? I mean, I've seen many researchers used instead openPCDet for their implementations

About the voxelization: for nuscenes you used 30000 as the maximum number of voxels and 20 as the maximum number of points per voxel. In other projects for kitti I saw 12000 as max_voxel_num and 100 as max_points_in_voxel.... Can you please explain how these values affect the performance of the network? For my own data I chose the same config as nuscenes, but I don't know if I should change it given that my point cloud frames are quite sparse....

Apr 04 '21 11:04 xavidzo

topic: Is there any reason why you decided to use det3d for your work? I mean, I've seen many researchers used instead openPCDet for their implementations

I finished up most of the work for centerpoint before OpenPCDet gets Waymo / nuScenes support. I also tried OpenPCDet for a while last summer but it trains a little slower and the number is a little lower than my implementation here so I just stick with current codebase.

Apr 04 '21 14:04 tianweiy

About the voxelization: for nuscenes you used 30000 as the maximum number of voxels and 20 as the maximum number of points per voxel. In other projects for kitti I saw 12000 as max_voxel_num and 100 as max_points_in_voxel.... Can you please explain how these values affect the performance of the network? For my own data I chose the same config as nuscenes, but I don't know if I should change it given that my point cloud frames are quite sparse....

I think 100 is too large probably. 30000/12000 are set to cover most of the voxels / pillars. 20/100 are not that important and smaller number will be faster.

Apr 04 '21 14:04 tianweiy

Hello, I have one question about how the DataLoader works in the training script. I want to try Dynamic Voxelization, for which I need the points as input to the network instead of the voxels. The number of features I used for each point is 4. I checked the points are stored also in the example dictionary, but somehow the dimension of the points changes after the data is passed to the DataLoader, especially the second dim from 4 to 5, please look at this:

# print statement in datasets/pipelines/formatting.py, the last step in the pipeline for loading the data
# @PIPELINES.register_module
# class Reformat(object):
#    def __init__(self, **kwargs):
#    -----------------------------------
#    def __call__(self, res, info):
#       meta = res["metadata"]
#       points = res["lidar"]["points"]
#       voxels = res["lidar"]["voxels"]      
#       print("poins shape in Reformat: ", points.shape)

points shape in Reformat: (132250, 4)

# print statement in datasets/providentia.py, my own dataset wrapper, similar to waymo.py
#    def get_sensor_data(self, idx): 
#    .....................................................
#         data, _ = self.pipeline(res, info)
#         print("points shape in get_sensor_data: ",data["points"].shape)

points shape in get_sensor_data: (132250, 4)

# print statement in torchie/trainer/trainer.py
#    def train(self, data_loader, epoch, **kwargs):
#     ......................................................
#            for i, data_batch in enumerate(data_loader):
#                   print("points shape in data_batch: ", data_batch["points"].shape)

points shape in data_batch: torch.Size([132087, 5]) # here the dimensions change, from 4 to 5

# print statement in models/detectors/point_pillars.py
#    def forward(self, example, return_loss=True, **kwargs):
#    ..........................................................
#                 print("example points shape: ", example["points"].shape)

example points shape: torch.Size([132087, 5])

Do you have any idea why the dimensions change?

Apr 13 '21 11:04 xavidzo

It is padded with an index to indicate its location in a batch, see this

https://github.com/tianweiy/CenterPoint/blob/9fdf572e854d20d58f4b0b8ef9f5785a5d39147a/det3d/torchie/parallel/collate.py#L141

you can remove this index

Apr 13 '21 18:04 tianweiy

CenterPoint CenterPoint copied to clipboard

Help with inference/training for own test data

CenterPoint
CenterPoint copied to clipboard