PointPillars icon indicating copy to clipboard operation
PointPillars copied to clipboard

pointpillars execution speed

Open cadwallader113 opened this issue 3 years ago • 14 comments

I was able to confirm the operation very easily compared to the PointPillars head family. It's great!! However, when I measured the execution speed, it was about 1.5 fps. I measured the following with time.time (). result_filter = model (batched_pts = [pc_torch], mode ='test') [0] According to the PointPillars treatise, it runs at 42.4fps in the Pytorch pipeline, except for TensorRT. What should I do to get the same fps?

cadwallader113 avatar Jul 14 '22 14:07 cadwallader113

Good question.

But I remember the inference was with a fast speed.

I'll test the inference speed later and reply here.

Updates

  1. Using TITAN X (an older GPU) to evaluate: about 18FPS 0715_2

  2. Using TITAN X (an older GPU) to test: 71.5ms / frame. 0715_4

  3. Using RTX 3090 to evaluate: about 42PFS. 0715_1

  4. Using RTX 3090 to test: 1.6s. 0715_3

In conclusion:

  • The average inference speed is 18PFS on TITAN X or 42FPS on RTX 3090. (Testing with batchsize=1) .
  • The single frame inference speed is 71.5ms on TITAN X or 1.6s on RTX 3090. The result on RTX 3090 is not reasonable, and i'm not sure why (~~One possible explanation is that other users are using shared resources on this machine~~ It may be related with the PyTorch version, see https://github.com/zhulf0804/PointPillars/issues/4#issuecomment-1223579731).

In addition, using fp16 can also accelerate the training/inference speed, but i didn't implement it in this repo.

zhulf0804 avatar Jul 15 '22 02:07 zhulf0804

Thank you for experimenting. It's interesting that older GPUs are faster in testing. The original implementation nutonomy / second.pytorch is also implemented in pytorch, but is it the GPU hardware that affects the inference speed of a single frame rather than the neural network design? Or are libraries such as spconv fast?

cadwallader113 avatar Jul 19 '22 10:07 cadwallader113

I'm not sure now.

I'll retest the speed on RTX3090 when the machine is free.

zhulf0804 avatar Jul 20 '22 02:07 zhulf0804

Thank you very much. I will continue to investigate your project to some extent.

cadwallader113 avatar Jul 20 '22 07:07 cadwallader113

I attempted to check the execution time of the PillarEncoder function. According to the result, nn.BatchNorm1d seems to take 500 ~ 600ms.

offset_pt_center_t: 0.000000 ms offset_pi_center_t: 12.212276 ms encoder_t: 0.000000 ms mask_t: 1.309156 ms features_t: 0.155210 ms to_bn_t: 624.006748 ms to_relu_t: 10.283470 ms features_relu__t: 1.000643 ms pooling_features: 0.000000 ms scatter_t: 5.024433 ms

cadwallader113 avatar Jul 20 '22 10:07 cadwallader113

Nice and careful analysis.

It seems strange about the time consumption of nn.BatchNorm1d.

Do you have any ideas about the result ? BTW, what type of your GPU and Cuda ?

zhulf0804 avatar Jul 20 '22 11:07 zhulf0804

~~inference time logs (just for tesing training/velodyne_reduced/000134.bin):~~

  • ~~1.6s on GTX 3090 (before)~~
  • ~~0.52s on GTX 3090 (4/8 gpu cards are being used by others).~~
  • ~~0.50s on GTX 3090 (2/8 gpu cards are being used by others).~~
  • ~~0.53s on GTX 3090 (0/8 gpu cards are being used by others).~~

zhulf0804 avatar Jul 24 '22 10:07 zhulf0804

cuda is 10.1 GTX 1080. I use test.py, epoch_160.pth, training/velodyne_reduced/000134.bin. There was an error in my measurement. I'm sorry. The processing time for nn.BatchNorm1d was roughly 1ms. It seems that the processing time of conv1d is long instead. 500~600ms Any ideas to speed this up? features = F.relu(self.bn(self.conv(features))) # (p1 + p2 + ... + pb, out_channels, num_points)

cadwallader113 avatar Aug 17 '22 02:08 cadwallader113

Hello @cadwallader113,

I replaced PyTorch 1.8.1 with PyTorch 1.7.1, and the inference speed was promoted from 520ms to 6ms. Both of them were tested on RTX 3090. So the speed may be related with PyTorch version. You can try it.

Additionally, I found that it has a lower inference speed for the 1st time to run nn.Conv1d compared to the 2nd, 3rd... time.

A clean code is as follows:

import torch
import torch.nn as nn
import time


class M(nn.Module):
    def __init__(self, in_channel, out_channel):
        super().__init__()
        self.conv = nn.Conv1d(in_channel, out_channel, 1, bias=False)

    def forward(self, x):
        x = self.conv(x)


m = M(9, 64).cuda()
for i in range(10):
    x = torch.randn(6169, 9, 32).float().cuda()
    tic = time.time()
    y = m(x)
    toc = time.time()

    print(f'iter {i} dur', toc - tic)
  1. With PyTorch1.7.1 + cu110 on RTX3090.
iter 0 dur 0.00656437873840332
iter 1 dur 0.0001876354217529297
iter 2 dur 0.00014901161193847656
iter 3 dur 0.00013113021850585938
iter 4 dur 0.00013136863708496094
iter 5 dur 0.0001246929168701172
iter 6 dur 0.00012302398681640625
iter 7 dur 0.00011420249938964844
iter 8 dur 0.00013327598571777344
iter 9 dur 0.00011563301086425781
  1. With PyTorch1.8.1 + cu111 on RTX3090.
iter 0 dur 0.5492613315582275
iter 1 dur 0.00022077560424804688
iter 2 dur 0.00012803077697753906
iter 3 dur 0.00010204315185546875
iter 4 dur 0.00010251998901367188
iter 5 dur 9.989738464355469e-05
iter 6 dur 9.942054748535156e-05
iter 7 dur 0.00010180473327636719
iter 8 dur 9.894371032714844e-05
iter 9 dur 9.965896606445312e-05

Best regards.

zhulf0804 avatar Aug 23 '22 05:08 zhulf0804

thanks for this good solution! I’ll give it a try and share your results.

cadwallader113 avatar Aug 25 '22 08:08 cadwallader113

thank you for the sample program. I changed pytorch and cuda to be the same as yours, but the difference was small... But like you, iterating solved the conv1d bottleneck for me. With pytorch1.7.1 + cuda110 on RTX1080

iter 0 dur 0.4629683494567871
iter 1 dur 0.0
iter 2 dur 0.0
iter 3 dur 0.0
iter 4 dur 0.0010001659393310547
iter 5 dur 0.0
iter 6 dur 0.0
iter 7 dur 0.0
iter 8 dur 0.0010042190551757812
iter 9 dur 0.0

Could you tell me the fps when iterating the input data in the test.py with GTX3090? I would like to know if test.py will output 42fps as well as evaluate.py. https://github.com/zhulf0804/PointPillars/issues/4#issuecomment-1185110735 in case 4 Using RTX 3090 to test

cadwallader113 avatar Sep 02 '22 08:09 cadwallader113

Hi @cadwallader113,

I'll test the fps in the next few days.

Best.

Update

I test as the following two steps:

  1. load the data in all_val_data (List)
val_dataset = Kitti(data_root='/mnt/ssd1/lifa_rdata/det/kitti_test',
                        split='val')
val_dataloader = get_dataloader(dataset=val_dataset, 
                                batch_size=1, 
                                num_workers=1,
                                shuffle=False)
all_val_data = []
for i, data_dict in enumerate(tqdm(val_dataloader)):
    if not args.no_cuda:
        # move the tensors to the cuda
        for key in data_dict:
            for j, item in enumerate(data_dict[key]):
                if torch.is_tensor(item):
                    data_dict[key][j] = data_dict[key][j].cuda()

    batched_pts = data_dict['batched_pts']
    all_val_data.append(batched_pts[0])
  1. test iteratively
for pc_torch in tqdm(all_val_data):
      # pc = read_points(args.pc_path)
      # pc = point_range_filter(pc)
      # pc_torch = torch.from_numpy(pc)
      if os.path.exists(args.calib_path):
          calib_info = read_calib(args.calib_path)
      else:
          calib_info = None

      if os.path.exists(args.gt_path):
          gt_label = read_label(args.gt_path)
      else:
          gt_label = None

      if os.path.exists(args.img_path):
          img = cv2.imread(args.img_path, 1)
      else:
          img = None

      model.eval()
      tic = time.time()
      with torch.no_grad():
          if not args.no_cuda:
              pc_torch = pc_torch.cuda()

          result_filter = model(batched_pts=[pc_torch], 
                              mode='test')[0]
      if calib_info is not None and img is not None:
          tr_velo_to_cam = calib_info['Tr_velo_to_cam'].astype(np.float32)
          r0_rect = calib_info['R0_rect'].astype(np.float32)
          P2 = calib_info['P2'].astype(np.float32)

          image_shape = img.shape[:2]
          result_filter = keep_bbox_from_image_range(result_filter, tr_velo_to_cam, r0_rect, P2, image_shape)

      result_filter = keep_bbox_from_lidar_range(result_filter, pcd_limit_range)
      lidar_bboxes = result_filter['lidar_bboxes']
      labels, scores = result_filter['labels'], result_filter['scores']
      toc = time.time()
      # print('pred dur: ', toc - tic)
      # vis_pc(pc, bboxes=lidar_bboxes, labels=labels)

The result is shown in the following figure. Screen Shot 2022-09-12 at 4 36 58 PM

It's about 48FPS.

Is this what you want ?

zhulf0804 avatar Sep 09 '22 02:09 zhulf0804

I also tested iterating the 000134.bin as follows.

for i in range(10):
    pc = read_points(args.pc_path)
    pc = point_range_filter(pc)
    pc_torch = torch.from_numpy(pc)
    if os.path.exists(args.calib_path):
        calib_info = read_calib(args.calib_path)
    else:
        calib_info = None

    if os.path.exists(args.gt_path):
        gt_label = read_label(args.gt_path)
    else:
        gt_label = None

    if os.path.exists(args.img_path):
        img = cv2.imread(args.img_path, 1)
    else:
        img = None

    model.eval()
    tic = time.time()
    with torch.no_grad():
        if not args.no_cuda:
            pc_torch = pc_torch.cuda()

        result_filter = model(batched_pts=[pc_torch], 
                            mode='test')[0]
    if calib_info is not None and img is not None:
        tr_velo_to_cam = calib_info['Tr_velo_to_cam'].astype(np.float32)
        r0_rect = calib_info['R0_rect'].astype(np.float32)
        P2 = calib_info['P2'].astype(np.float32)

        image_shape = img.shape[:2]
        result_filter = keep_bbox_from_image_range(result_filter, tr_velo_to_cam, r0_rect, P2, image_shape)

    result_filter = keep_bbox_from_lidar_range(result_filter, pcd_limit_range)
    lidar_bboxes = result_filter['lidar_bboxes']
    labels, scores = result_filter['labels'], result_filter['scores']
    toc = time.time()
    print('pred dur: ', toc - tic)
    # vis_pc(pc, bboxes=lidar_bboxes, labels=labels)

The result is shown in the figure below. Screen Shot 2022-09-12 at 4 46 14 PM

It's about 45 FPS, including the time of loading data.

zhulf0804 avatar Sep 12 '22 08:09 zhulf0804

Thank you for your reply. Great result. In my tests, test.py speed was around 15fps on GTX1080. So, if I want to increase from 15fps to 42fps, I have to change the graphics board. The last thing I don't understand is that in the paper, the GTX1080Ti gives 42fps despite its lower performance than the 3090. If you know anything, please let me know. I will close this issue. thank you very much.

cadwallader113 avatar Sep 14 '22 06:09 cadwallader113