KPConv-PyTorch icon indicating copy to clipboard operation
KPConv-PyTorch copied to clipboard

How to speed up the model inference.

Open Trexzhou opened this issue 2 years ago • 10 comments

Hi, @HuguesTHOMAS ! Thank you for your excellent work!

I'm trying to test the time-cost when model inference by: image

I want to reduce the time-cost of the model inference, and I have tried the following ways:

  1. Modify network structure by rewrite the "architecture" in class Config(). I find that the most effective way to reduce the time-cost was to remove the layer which is containing "KPCONV" operation, I have tried the architecture like: image and the time-cost was reduce from 20ms to 11ms.
  2. Reduce "num_kernel_points" in in class Config(), but I find that this will not lead to a significant reduction in time-cost of inference. I think reducing the number of kernels will reduce the number of convolution operations, so that we can reduce the time-cost, am I right ? Would you please to correct me : )

I will share with you your excellent work in my own dataset, that's amazing!! image

Trexzhou avatar Jul 06 '22 12:07 Trexzhou

What parameter config have you used? (raidus and first_subsampling_dl)

working12 avatar Jul 06 '22 18:07 working12

@working12 hi, in_radius = 6.0 val_radius = 51.0 first_subsampling_dl = 0.25

Trexzhou avatar Jul 08 '22 01:07 Trexzhou

Hey my friend! Do not forget torch.cuda.sychronize() when you test inference time!

aoligei178 avatar Jul 09 '22 01:07 aoligei178

Hi @Trexzhou,

Thanks for your message, results look really nice. As @working12 mentioned, the val_radius and first_subsampling_dl are two factors that have a high impact on inference time. In your case, I guess you want to classify a whole lidar frame (or consecutive frames merged together) so it does not make sense to reduce the val_radius. However, you can try to increase the first_subsampling_dl parameter to reduce the number of points and thus the inference time. This could also impact the performance so you will have to find the tradeoff between speed and performance.

About num_kernel_points, they are already reduced to a very low value (15 is nearly 2 times less than if you used a 3 by 3 by 3 grid kernel). I would not touch that.

Do not forget torch.cuda.sychronize() when you test inference time!

indeed this is important when you measure times on GPU

Also, last remark, do not forget that a large part of the inference time (if you were to use this on a real robot/car), comes from the CPU preprocessing. I am working on optimizing that but right now, the code is optimized to process large pointclouds of whole areas in parallel, which is very different from processing a small lidar frame as fast as possible.

HuguesTHOMAS avatar Jul 11 '22 14:07 HuguesTHOMAS

Hey my friend! Do not forget torch.cuda.sychronize() when you test inference time!

Thank you my friend!:)

Trexzhou avatar Aug 01 '22 01:08 Trexzhou

Hi @Trexzhou,

Thanks for your message, results look really nice. As @working12 mentioned, the val_radius and first_subsampling_dl are two factors that have a high impact on inference time. In your case, I guess you want to classify a whole lidar frame (or consecutive frames merged together) so it does not make sense to reduce the val_radius. However, you can try to increase the first_subsampling_dl parameter to reduce the number of points and thus the inference time. This could also impact the performance so you will have to find the tradeoff between speed and performance.

About num_kernel_points, they are already reduced to a very low value (15 is nearly 2 times less than if you used a 3 by 3 by 3 grid kernel). I would not touch that.

Do not forget torch.cuda.sychronize() when you test inference time!

indeed this is important when you measure times on GPU

Also, last remark, do not forget that a large part of the inference time (if you were to use this on a real robot/car), comes from the CPU preprocessing. I am working on optimizing that but right now, the code is optimized to process large pointclouds of whole areas in parallel, which is very different from processing a small lidar frame as fast as possible.

Hello Thomas,

Thank you for your response and suggestions! Meanwhile, I am trying to understand the function cpp_neighbors.batch_query() that was called in functionbatch_neighbors(queries, supports, q_batches, s_batches, radius). What is query points and support points means? Could you tell me more details so that I can understand what the input/output is? Appreciate for your excellent work again!

Trexzhou avatar Aug 01 '22 01:08 Trexzhou

Hello Thomas,

As I mentioned before, I was trying to understand what is cpp_neighbors.batch_query() doing. I tried to output the result by making the fake input(as shown in the picture below, fake_points_A is the points whose y and z coordinates are 0.0, fake_points_B is the points whose y coordinates are 1.0 and z coordinates are 0.0) image

and I've got the output: image some questions:

  1. Is it correct that the way I made the fake input?
  2. What is the output meaning? How do I understand the output?

Trexzhou avatar Aug 01 '22 07:08 Trexzhou

Hi @Trexzhou,

First here is an answer to your first question.

Thank you for your response and suggestions! Meanwhile, I am trying to understand the function cpp_neighbors.batch_query() that was called in functionbatch_neighbors(queries, supports, q_batches, s_batches, radius). What is query points and support points means? Could you tell me more details so that I can understand what the input/output is? Appreciate for your excellent work again!

I just answered the same question here: https://github.com/HuguesTHOMAS/KPConv-PyTorch/issues/191#issuecomment-1201729232

HuguesTHOMAS avatar Aug 01 '22 21:08 HuguesTHOMAS

Now about your test. Here is a small scheme of your problem:

A = .
B = +
r = 2  <=>  |-------|

cloud1
----0---1---2---3---4---5---6---7---8---9---10--
|
1   .   .   .   .   .   .                    
|
2                   +                        

cloud2
----0---1---2---3---4---5---6---7---8---9---10--
|
1                                           .
|
2                                       +   +

Here is what the cpp_neighbors.batch_query() function did: find neighbors of the points in batch A (query) in the points of batch B (support), with a radius of 2. For each point of cloud1 in batch A the neighbors are:

[]
[]
[]
[0]
[0]
[0]

The only point of cloud2 in batch A have 2 neighbors:

[1, 0]

Now we stack these results, and be careful to offset the neighbors of cloud 2 so they still point to the right index:

A             Neighbs             B
[ 0, 0, 0]    []                  [ 4, 1, 0] 
[ 1, 0, 0]    []                  [ 9, 1, 0]
[ 2, 0, 0]    []                  [10, 1, 0]
[ 3, 0, 0]    [0]
[ 4, 0, 0]    [0]
[ 5, 0, 0]    [0]
[10, 0, 0]    [2, 1]

See how the neighbors from cloud2 in A are offset by 1 which is the length of the cloud1 in B. So that now in the full stacked batch the neighbors are not overlapping between two point clouds.

Eventually, we add the shadow neighbors to make a nice matrix that can be used by GPU. THe shadow value here is 3 because it is the length of batch B.

[3, 3]
[3, 3]
[3, 3]
[0, 3]
[0, 3]
[0, 3]
[2, 1]

HuguesTHOMAS avatar Aug 01 '22 21:08 HuguesTHOMAS

Now about your test. Here is a small scheme of your problem:

A = .
B = +
r = 2  <=>  |-------|

cloud1
----0---1---2---3---4---5---6---7---8---9---10--
|
1   .   .   .   .   .   .                    
|
2                   +                        

cloud2
----0---1---2---3---4---5---6---7---8---9---10--
|
1                                           .
|
2                                       +   +

Here is what the cpp_neighbors.batch_query() function did: find neighbors of the points in batch A (query) in the points of batch B (support), with a radius of 2. For each point of cloud1 in batch A the neighbors are:

[]
[]
[]
[0]
[0]
[0]

The only point of cloud2 in batch A have 2 neighbors:

[1, 0]

Now we stack these results, and be careful to offset the neighbors of cloud 2 so they still point to the right index:

A             Neighbs             B
[ 0, 0, 0]    []                  [ 4, 1, 0] 
[ 1, 0, 0]    []                  [ 9, 1, 0]
[ 2, 0, 0]    []                  [10, 1, 0]
[ 3, 0, 0]    [0]
[ 4, 0, 0]    [0]
[ 5, 0, 0]    [0]
[10, 0, 0]    [2, 1]

See how the neighbors from cloud2 in A are offset by 1 which is the length of the cloud1 in B. So that now in the full stacked batch the neighbors are not overlapping between two point clouds.

Eventually, we add the shadow neighbors to make a nice matrix that can be used by GPU. THe shadow value here is 3 because it is the length of batch B.

[3, 3]
[3, 3]
[3, 3]
[0, 3]
[0, 3]
[0, 3]
[2, 1]

Hi @HuguesTHOMAS ,

Wow that's cool! Great thanks for your amazing explanations, it made me completely understand what the function did!

I will try my best to get more excellent results, and I will share it with you :)

Thaks a lot for your patient and professional reply!

Trexzhou avatar Aug 03 '22 02:08 Trexzhou