PointPillars pointpillars execution speed

I was able to confirm the operation very easily compared to the PointPillars head family. It's great!! However, when I measured the execution speed, it was about 1.5 fps. I measured the following with time.time (). result_filter = model (batched_pts = [pc_torch], mode ='test') [0] According to the PointPillars treatise, it runs at 42.4fps in the Pytorch pipeline, except for TensorRT. What should I do to get the same fps?

Jul 14 '22 14:07 cadwallader113

Good question.

But I remember the inference was with a fast speed.

I'll test the inference speed later and reply here.

Updates

Using TITAN X (an older GPU) to evaluate: about 18FPS
Using TITAN X (an older GPU) to test: 71.5ms / frame.
Using RTX 3090 to evaluate: about 42PFS.
Using RTX 3090 to test: 1.6s.

In conclusion:

The average inference speed is 18PFS on TITAN X or 42FPS on RTX 3090. (Testing with batchsize=1) .
The single frame inference speed is 71.5ms on TITAN X or 1.6s on RTX 3090. The result on RTX 3090 is not reasonable, and i'm not sure why (~~One possible explanation is that other users are using shared resources on this machine~~ It may be related with the PyTorch version, see https://github.com/zhulf0804/PointPillars/issues/4#issuecomment-1223579731).

In addition, using fp16 can also accelerate the training/inference speed, but i didn't implement it in this repo.

Jul 15 '22 02:07 zhulf0804

Thank you for experimenting. It's interesting that older GPUs are faster in testing. The original implementation nutonomy / second.pytorch is also implemented in pytorch, but is it the GPU hardware that affects the inference speed of a single frame rather than the neural network design? Or are libraries such as spconv fast?

Jul 19 '22 10:07 cadwallader113

I'm not sure now.

I'll retest the speed on RTX3090 when the machine is free.

Jul 20 '22 02:07 zhulf0804

Thank you very much. I will continue to investigate your project to some extent.

Jul 20 '22 07:07 cadwallader113

I attempted to check the execution time of the PillarEncoder function. According to the result, nn.BatchNorm1d seems to take 500 ~ 600ms.

offset_pt_center_t: 0.000000 ms offset_pi_center_t: 12.212276 ms encoder_t: 0.000000 ms mask_t: 1.309156 ms features_t: 0.155210 ms to_bn_t: 624.006748 ms to_relu_t: 10.283470 ms features_relu__t: 1.000643 ms pooling_features: 0.000000 ms scatter_t: 5.024433 ms

Jul 20 '22 10:07 cadwallader113

Nice and careful analysis.

It seems strange about the time consumption of nn.BatchNorm1d.

Do you have any ideas about the result ? BTW, what type of your GPU and Cuda ?

Jul 20 '22 11:07 zhulf0804

~~inference time logs (just for tesing training/velodyne_reduced/000134.bin):~~

~~1.6s on GTX 3090 (before)~~
~~0.52s on GTX 3090 (4/8 gpu cards are being used by others).~~
~~0.50s on GTX 3090 (2/8 gpu cards are being used by others).~~
~~0.53s on GTX 3090 (0/8 gpu cards are being used by others).~~

Jul 24 '22 10:07 zhulf0804

cuda is 10.1 GTX 1080. I use test.py, epoch_160.pth, training/velodyne_reduced/000134.bin. There was an error in my measurement. I'm sorry. The processing time for nn.BatchNorm1d was roughly 1ms. It seems that the processing time of conv1d is long instead. 500~600ms Any ideas to speed this up? features = F.relu(self.bn(self.conv(features))) # (p1 + p2 + ... + pb, out_channels, num_points)

Aug 17 '22 02:08 cadwallader113

Hello @cadwallader113,

I replaced PyTorch 1.8.1 with PyTorch 1.7.1, and the inference speed was promoted from 520ms to 6ms. Both of them were tested on RTX 3090. So the speed may be related with PyTorch version. You can try it.

Additionally, I found that it has a lower inference speed for the 1st time to run nn.Conv1d compared to the 2nd, 3rd... time.

A clean code is as follows:

import torch
import torch.nn as nn
import time


class M(nn.Module):
    def __init__(self, in_channel, out_channel):
        super().__init__()
        self.conv = nn.Conv1d(in_channel, out_channel, 1, bias=False)

    def forward(self, x):
        x = self.conv(x)


m = M(9, 64).cuda()
for i in range(10):
    x = torch.randn(6169, 9, 32).float().cuda()
    tic = time.time()
    y = m(x)
    toc = time.time()

    print(f'iter {i} dur', toc - tic)

With PyTorch1.7.1 + cu110 on RTX3090.

iter 0 dur 0.00656437873840332
iter 1 dur 0.0001876354217529297
iter 2 dur 0.00014901161193847656
iter 3 dur 0.00013113021850585938
iter 4 dur 0.00013136863708496094
iter 5 dur 0.0001246929168701172
iter 6 dur 0.00012302398681640625
iter 7 dur 0.00011420249938964844
iter 8 dur 0.00013327598571777344
iter 9 dur 0.00011563301086425781

With PyTorch1.8.1 + cu111 on RTX3090.

iter 0 dur 0.5492613315582275
iter 1 dur 0.00022077560424804688
iter 2 dur 0.00012803077697753906
iter 3 dur 0.00010204315185546875
iter 4 dur 0.00010251998901367188
iter 5 dur 9.989738464355469e-05
iter 6 dur 9.942054748535156e-05
iter 7 dur 0.00010180473327636719
iter 8 dur 9.894371032714844e-05
iter 9 dur 9.965896606445312e-05

Best regards.

Aug 23 '22 05:08 zhulf0804

thanks for this good solution! I’ll give it a try and share your results.

Aug 25 '22 08:08 cadwallader113

thank you for the sample program. I changed pytorch and cuda to be the same as yours, but the difference was small... But like you, iterating solved the conv1d bottleneck for me. With pytorch1.7.1 + cuda110 on RTX1080

iter 0 dur 0.4629683494567871
iter 1 dur 0.0
iter 2 dur 0.0
iter 3 dur 0.0
iter 4 dur 0.0010001659393310547
iter 5 dur 0.0
iter 6 dur 0.0
iter 7 dur 0.0
iter 8 dur 0.0010042190551757812
iter 9 dur 0.0

Could you tell me the fps when iterating the input data in the test.py with GTX3090? I would like to know if test.py will output 42fps as well as evaluate.py. https://github.com/zhulf0804/PointPillars/issues/4#issuecomment-1185110735 in case 4 Using RTX 3090 to test

Sep 02 '22 08:09 cadwallader113

Hi @cadwallader113,

I'll test the fps in the next few days.

Best.

Update

I test as the following two steps:

load the data in all_val_data (List)

val_dataset = Kitti(data_root='/mnt/ssd1/lifa_rdata/det/kitti_test',
                        split='val')
val_dataloader = get_dataloader(dataset=val_dataset, 
                                batch_size=1, 
                                num_workers=1,
                                shuffle=False)
all_val_data = []
for i, data_dict in enumerate(tqdm(val_dataloader)):
    if not args.no_cuda:
        # move the tensors to the cuda
        for key in data_dict:
            for j, item in enumerate(data_dict[key]):
                if torch.is_tensor(item):
                    data_dict[key][j] = data_dict[key][j].cuda()

    batched_pts = data_dict['batched_pts']
    all_val_data.append(batched_pts[0])

test iteratively

for pc_torch in tqdm(all_val_data):
      # pc = read_points(args.pc_path)
      # pc = point_range_filter(pc)
      # pc_torch = torch.from_numpy(pc)
      if os.path.exists(args.calib_path):
          calib_info = read_calib(args.calib_path)
      else:
          calib_info = None

      if os.path.exists(args.gt_path):
          gt_label = read_label(args.gt_path)
      else:
          gt_label = None

      if os.path.exists(args.img_path):
          img = cv2.imread(args.img_path, 1)
      else:
          img = None

      model.eval()
      tic = time.time()
      with torch.no_grad():
          if not args.no_cuda:
              pc_torch = pc_torch.cuda()

          result_filter = model(batched_pts=[pc_torch], 
                              mode='test')[0]
      if calib_info is not None and img is not None:
          tr_velo_to_cam = calib_info['Tr_velo_to_cam'].astype(np.float32)
          r0_rect = calib_info['R0_rect'].astype(np.float32)
          P2 = calib_info['P2'].astype(np.float32)

          image_shape = img.shape[:2]
          result_filter = keep_bbox_from_image_range(result_filter, tr_velo_to_cam, r0_rect, P2, image_shape)

      result_filter = keep_bbox_from_lidar_range(result_filter, pcd_limit_range)
      lidar_bboxes = result_filter['lidar_bboxes']
      labels, scores = result_filter['labels'], result_filter['scores']
      toc = time.time()
      # print('pred dur: ', toc - tic)
      # vis_pc(pc, bboxes=lidar_bboxes, labels=labels)

The result is shown in the following figure. Screen Shot 2022-09-12 at 4 36 58 PM

It's about 48FPS.

Is this what you want ?

Sep 09 '22 02:09 zhulf0804

I also tested iterating the 000134.bin as follows.

for i in range(10):
    pc = read_points(args.pc_path)
    pc = point_range_filter(pc)
    pc_torch = torch.from_numpy(pc)
    if os.path.exists(args.calib_path):
        calib_info = read_calib(args.calib_path)
    else:
        calib_info = None

    if os.path.exists(args.gt_path):
        gt_label = read_label(args.gt_path)
    else:
        gt_label = None

    if os.path.exists(args.img_path):
        img = cv2.imread(args.img_path, 1)
    else:
        img = None

    model.eval()
    tic = time.time()
    with torch.no_grad():
        if not args.no_cuda:
            pc_torch = pc_torch.cuda()

        result_filter = model(batched_pts=[pc_torch], 
                            mode='test')[0]
    if calib_info is not None and img is not None:
        tr_velo_to_cam = calib_info['Tr_velo_to_cam'].astype(np.float32)
        r0_rect = calib_info['R0_rect'].astype(np.float32)
        P2 = calib_info['P2'].astype(np.float32)

        image_shape = img.shape[:2]
        result_filter = keep_bbox_from_image_range(result_filter, tr_velo_to_cam, r0_rect, P2, image_shape)

    result_filter = keep_bbox_from_lidar_range(result_filter, pcd_limit_range)
    lidar_bboxes = result_filter['lidar_bboxes']
    labels, scores = result_filter['labels'], result_filter['scores']
    toc = time.time()
    print('pred dur: ', toc - tic)
    # vis_pc(pc, bboxes=lidar_bboxes, labels=labels)

The result is shown in the figure below. Screen Shot 2022-09-12 at 4 46 14 PM

It's about 45 FPS, including the time of loading data.

Sep 12 '22 08:09 zhulf0804

Thank you for your reply. Great result. In my tests, test.py speed was around 15fps on GTX1080. So, if I want to increase from 15fps to 42fps, I have to change the graphics board. The last thing I don't understand is that in the paper, the GTX1080Ti gives 42fps despite its lower performance than the 3090. If you know anything, please let me know. I will close this issue. thank you very much.

Sep 14 '22 06:09 cadwallader113