pointpillars execution speed
I was able to confirm the operation very easily compared to the PointPillars head family. It's great!! However, when I measured the execution speed, it was about 1.5 fps. I measured the following with time.time (). result_filter = model (batched_pts = [pc_torch], mode ='test') [0] According to the PointPillars treatise, it runs at 42.4fps in the Pytorch pipeline, except for TensorRT. What should I do to get the same fps?
Good question.
But I remember the inference was with a fast speed.
I'll test the inference speed later and reply here.
Updates
-
Using TITAN X (an older GPU) to evaluate: about 18FPS

-
Using TITAN X (an older GPU) to test: 71.5ms / frame.

-
Using RTX 3090 to evaluate: about 42PFS.

-
Using RTX 3090 to test: 1.6s.

In conclusion:
- The average inference speed is
18PFS on TITAN Xor42FPS on RTX 3090. (Testing with batchsize=1) . - The single frame inference speed is
71.5ms on TITAN Xor1.6s on RTX 3090. The result on RTX 3090 is not reasonable, and i'm not sure why (~~One possible explanation is that other users are using shared resources on this machine~~ It may be related with the PyTorch version, see https://github.com/zhulf0804/PointPillars/issues/4#issuecomment-1223579731).
In addition, using fp16 can also accelerate the training/inference speed, but i didn't implement it in this repo.
Thank you for experimenting. It's interesting that older GPUs are faster in testing. The original implementation nutonomy / second.pytorch is also implemented in pytorch, but is it the GPU hardware that affects the inference speed of a single frame rather than the neural network design? Or are libraries such as spconv fast?
I'm not sure now.
I'll retest the speed on RTX3090 when the machine is free.
Thank you very much. I will continue to investigate your project to some extent.
I attempted to check the execution time of the PillarEncoder function.
According to the result, nn.BatchNorm1d seems to take 500 ~ 600ms.
offset_pt_center_t: 0.000000 ms offset_pi_center_t: 12.212276 ms encoder_t: 0.000000 ms mask_t: 1.309156 ms features_t: 0.155210 ms to_bn_t: 624.006748 ms to_relu_t: 10.283470 ms features_relu__t: 1.000643 ms pooling_features: 0.000000 ms scatter_t: 5.024433 ms
Nice and careful analysis.
It seems strange about the time consumption of nn.BatchNorm1d.
Do you have any ideas about the result ? BTW, what type of your GPU and Cuda ?
~~inference time logs (just for tesing training/velodyne_reduced/000134.bin):~~
- ~~1.6s on GTX 3090 (before)~~
- ~~0.52s on GTX 3090 (4/8 gpu cards are being used by others).~~
- ~~0.50s on GTX 3090 (2/8 gpu cards are being used by others).~~
- ~~0.53s on GTX 3090 (0/8 gpu cards are being used by others).~~
cuda is 10.1 GTX 1080.
I use test.py, epoch_160.pth, training/velodyne_reduced/000134.bin.
There was an error in my measurement. I'm sorry. The processing time for nn.BatchNorm1d was roughly 1ms. It seems that the processing time of conv1d is long instead. 500~600ms
Any ideas to speed this up?
features = F.relu(self.bn(self.conv(features))) # (p1 + p2 + ... + pb, out_channels, num_points)
Hello @cadwallader113,
I replaced PyTorch 1.8.1 with PyTorch 1.7.1, and the inference speed was promoted from 520ms to 6ms. Both of them were tested on RTX 3090. So the speed may be related with PyTorch version. You can try it.
Additionally, I found that it has a lower inference speed for the 1st time to run nn.Conv1d compared to the 2nd, 3rd... time.
A clean code is as follows:
import torch
import torch.nn as nn
import time
class M(nn.Module):
def __init__(self, in_channel, out_channel):
super().__init__()
self.conv = nn.Conv1d(in_channel, out_channel, 1, bias=False)
def forward(self, x):
x = self.conv(x)
m = M(9, 64).cuda()
for i in range(10):
x = torch.randn(6169, 9, 32).float().cuda()
tic = time.time()
y = m(x)
toc = time.time()
print(f'iter {i} dur', toc - tic)
- With
PyTorch1.7.1+ cu110 on RTX3090.
iter 0 dur 0.00656437873840332
iter 1 dur 0.0001876354217529297
iter 2 dur 0.00014901161193847656
iter 3 dur 0.00013113021850585938
iter 4 dur 0.00013136863708496094
iter 5 dur 0.0001246929168701172
iter 6 dur 0.00012302398681640625
iter 7 dur 0.00011420249938964844
iter 8 dur 0.00013327598571777344
iter 9 dur 0.00011563301086425781
- With
PyTorch1.8.1+ cu111 on RTX3090.
iter 0 dur 0.5492613315582275
iter 1 dur 0.00022077560424804688
iter 2 dur 0.00012803077697753906
iter 3 dur 0.00010204315185546875
iter 4 dur 0.00010251998901367188
iter 5 dur 9.989738464355469e-05
iter 6 dur 9.942054748535156e-05
iter 7 dur 0.00010180473327636719
iter 8 dur 9.894371032714844e-05
iter 9 dur 9.965896606445312e-05
Best regards.
thanks for this good solution! I’ll give it a try and share your results.
thank you for the sample program.
I changed pytorch and cuda to be the same as yours, but the difference was small...
But like you, iterating solved the conv1d bottleneck for me.
With pytorch1.7.1 + cuda110 on RTX1080
iter 0 dur 0.4629683494567871
iter 1 dur 0.0
iter 2 dur 0.0
iter 3 dur 0.0
iter 4 dur 0.0010001659393310547
iter 5 dur 0.0
iter 6 dur 0.0
iter 7 dur 0.0
iter 8 dur 0.0010042190551757812
iter 9 dur 0.0
Could you tell me the fps when iterating the input data in the test.py with GTX3090?
I would like to know if test.py will output 42fps as well as evaluate.py.
https://github.com/zhulf0804/PointPillars/issues/4#issuecomment-1185110735
in case 4 Using RTX 3090 to test
Hi @cadwallader113,
I'll test the fps in the next few days.
Best.
Update
I test as the following two steps:
- load the data in all_val_data (List)
val_dataset = Kitti(data_root='/mnt/ssd1/lifa_rdata/det/kitti_test',
split='val')
val_dataloader = get_dataloader(dataset=val_dataset,
batch_size=1,
num_workers=1,
shuffle=False)
all_val_data = []
for i, data_dict in enumerate(tqdm(val_dataloader)):
if not args.no_cuda:
# move the tensors to the cuda
for key in data_dict:
for j, item in enumerate(data_dict[key]):
if torch.is_tensor(item):
data_dict[key][j] = data_dict[key][j].cuda()
batched_pts = data_dict['batched_pts']
all_val_data.append(batched_pts[0])
- test iteratively
for pc_torch in tqdm(all_val_data):
# pc = read_points(args.pc_path)
# pc = point_range_filter(pc)
# pc_torch = torch.from_numpy(pc)
if os.path.exists(args.calib_path):
calib_info = read_calib(args.calib_path)
else:
calib_info = None
if os.path.exists(args.gt_path):
gt_label = read_label(args.gt_path)
else:
gt_label = None
if os.path.exists(args.img_path):
img = cv2.imread(args.img_path, 1)
else:
img = None
model.eval()
tic = time.time()
with torch.no_grad():
if not args.no_cuda:
pc_torch = pc_torch.cuda()
result_filter = model(batched_pts=[pc_torch],
mode='test')[0]
if calib_info is not None and img is not None:
tr_velo_to_cam = calib_info['Tr_velo_to_cam'].astype(np.float32)
r0_rect = calib_info['R0_rect'].astype(np.float32)
P2 = calib_info['P2'].astype(np.float32)
image_shape = img.shape[:2]
result_filter = keep_bbox_from_image_range(result_filter, tr_velo_to_cam, r0_rect, P2, image_shape)
result_filter = keep_bbox_from_lidar_range(result_filter, pcd_limit_range)
lidar_bboxes = result_filter['lidar_bboxes']
labels, scores = result_filter['labels'], result_filter['scores']
toc = time.time()
# print('pred dur: ', toc - tic)
# vis_pc(pc, bboxes=lidar_bboxes, labels=labels)
The result is shown in the following figure.

It's about 48FPS.
Is this what you want ?
I also tested iterating the 000134.bin as follows.
for i in range(10):
pc = read_points(args.pc_path)
pc = point_range_filter(pc)
pc_torch = torch.from_numpy(pc)
if os.path.exists(args.calib_path):
calib_info = read_calib(args.calib_path)
else:
calib_info = None
if os.path.exists(args.gt_path):
gt_label = read_label(args.gt_path)
else:
gt_label = None
if os.path.exists(args.img_path):
img = cv2.imread(args.img_path, 1)
else:
img = None
model.eval()
tic = time.time()
with torch.no_grad():
if not args.no_cuda:
pc_torch = pc_torch.cuda()
result_filter = model(batched_pts=[pc_torch],
mode='test')[0]
if calib_info is not None and img is not None:
tr_velo_to_cam = calib_info['Tr_velo_to_cam'].astype(np.float32)
r0_rect = calib_info['R0_rect'].astype(np.float32)
P2 = calib_info['P2'].astype(np.float32)
image_shape = img.shape[:2]
result_filter = keep_bbox_from_image_range(result_filter, tr_velo_to_cam, r0_rect, P2, image_shape)
result_filter = keep_bbox_from_lidar_range(result_filter, pcd_limit_range)
lidar_bboxes = result_filter['lidar_bboxes']
labels, scores = result_filter['labels'], result_filter['scores']
toc = time.time()
print('pred dur: ', toc - tic)
# vis_pc(pc, bboxes=lidar_bboxes, labels=labels)
The result is shown in the figure below.

It's about 45 FPS, including the time of loading data.
Thank you for your reply. Great result. In my tests, test.py speed was around 15fps on GTX1080. So, if I want to increase from 15fps to 42fps, I have to change the graphics board. The last thing I don't understand is that in the paper, the GTX1080Ti gives 42fps despite its lower performance than the 3090. If you know anything, please let me know. I will close this issue. thank you very much.