KPConv-PyTorch
KPConv-PyTorch copied to clipboard
How to speed up the model inference.
Hi, @HuguesTHOMAS ! Thank you for your excellent work!
I'm trying to test the time-cost when model inference by:
I want to reduce the time-cost of the model inference, and I have tried the following ways:
- Modify network structure by rewrite the "architecture" in class Config(). I find that the most effective way to reduce the time-cost was to remove the layer which is containing "KPCONV" operation, I have tried the architecture like:
and the time-cost was reduce from 20ms to 11ms.
- Reduce "num_kernel_points" in in class Config(), but I find that this will not lead to a significant reduction in time-cost of inference. I think reducing the number of kernels will reduce the number of convolution operations, so that we can reduce the time-cost, am I right ? Would you please to correct me : )
I will share with you your excellent work in my own dataset, that's amazing!!
What parameter config have you used? (raidus
and first_subsampling_dl
)
@working12 hi, in_radius = 6.0 val_radius = 51.0 first_subsampling_dl = 0.25
Hey my friend! Do not forget torch.cuda.sychronize() when you test inference time!
Hi @Trexzhou,
Thanks for your message, results look really nice. As @working12 mentioned, the val_radius
and first_subsampling_dl
are two factors that have a high impact on inference time. In your case, I guess you want to classify a whole lidar frame (or consecutive frames merged together) so it does not make sense to reduce the val_radius
. However, you can try to increase the first_subsampling_dl parameter to reduce the number of points and thus the inference time. This could also impact the performance so you will have to find the tradeoff between speed and performance.
About num_kernel_points
, they are already reduced to a very low value (15 is nearly 2 times less than if you used a 3 by 3 by 3 grid kernel). I would not touch that.
Do not forget torch.cuda.sychronize() when you test inference time!
indeed this is important when you measure times on GPU
Also, last remark, do not forget that a large part of the inference time (if you were to use this on a real robot/car), comes from the CPU preprocessing. I am working on optimizing that but right now, the code is optimized to process large pointclouds of whole areas in parallel, which is very different from processing a small lidar frame as fast as possible.
Hey my friend! Do not forget torch.cuda.sychronize() when you test inference time!
Thank you my friend!:)
Hi @Trexzhou,
Thanks for your message, results look really nice. As @working12 mentioned, the
val_radius
andfirst_subsampling_dl
are two factors that have a high impact on inference time. In your case, I guess you want to classify a whole lidar frame (or consecutive frames merged together) so it does not make sense to reduce theval_radius
. However, you can try to increase the first_subsampling_dl parameter to reduce the number of points and thus the inference time. This could also impact the performance so you will have to find the tradeoff between speed and performance.About
num_kernel_points
, they are already reduced to a very low value (15 is nearly 2 times less than if you used a 3 by 3 by 3 grid kernel). I would not touch that.Do not forget torch.cuda.sychronize() when you test inference time!
indeed this is important when you measure times on GPU
Also, last remark, do not forget that a large part of the inference time (if you were to use this on a real robot/car), comes from the CPU preprocessing. I am working on optimizing that but right now, the code is optimized to process large pointclouds of whole areas in parallel, which is very different from processing a small lidar frame as fast as possible.
Hello Thomas,
Thank you for your response and suggestions! Meanwhile, I am trying to understand the function cpp_neighbors.batch_query()
that was called in functionbatch_neighbors(queries, supports, q_batches, s_batches, radius)
. What is query points
and support points
means? Could you tell me more details so that I can understand what the input/output is? Appreciate for your excellent work again!
Hello Thomas,
As I mentioned before, I was trying to understand what is cpp_neighbors.batch_query()
doing. I tried to output the result by making the fake input(as shown in the picture below, fake_points_A
is the points whose y and z coordinates are 0.0, fake_points_B
is the points whose y coordinates are 1.0 and z coordinates are 0.0)
and I've got the output:
some questions:
- Is it correct that the way I made the fake input?
- What is the output meaning? How do I understand the output?
Hi @Trexzhou,
First here is an answer to your first question.
Thank you for your response and suggestions! Meanwhile, I am trying to understand the function cpp_neighbors.batch_query() that was called in functionbatch_neighbors(queries, supports, q_batches, s_batches, radius). What is query points and support points means? Could you tell me more details so that I can understand what the input/output is? Appreciate for your excellent work again!
I just answered the same question here: https://github.com/HuguesTHOMAS/KPConv-PyTorch/issues/191#issuecomment-1201729232
Now about your test. Here is a small scheme of your problem:
A = .
B = +
r = 2 <=> |-------|
cloud1
----0---1---2---3---4---5---6---7---8---9---10--
|
1 . . . . . .
|
2 +
cloud2
----0---1---2---3---4---5---6---7---8---9---10--
|
1 .
|
2 + +
Here is what the cpp_neighbors.batch_query()
function did: find neighbors of the points in batch A (query) in the points of batch B (support), with a radius of 2. For each point of cloud1 in batch A the neighbors are:
[]
[]
[]
[0]
[0]
[0]
The only point of cloud2 in batch A have 2 neighbors:
[1, 0]
Now we stack these results, and be careful to offset the neighbors of cloud 2 so they still point to the right index:
A Neighbs B
[ 0, 0, 0] [] [ 4, 1, 0]
[ 1, 0, 0] [] [ 9, 1, 0]
[ 2, 0, 0] [] [10, 1, 0]
[ 3, 0, 0] [0]
[ 4, 0, 0] [0]
[ 5, 0, 0] [0]
[10, 0, 0] [2, 1]
See how the neighbors from cloud2 in A are offset by 1 which is the length of the cloud1 in B. So that now in the full stacked batch the neighbors are not overlapping between two point clouds.
Eventually, we add the shadow neighbors to make a nice matrix that can be used by GPU. THe shadow value here is 3 because it is the length of batch B.
[3, 3]
[3, 3]
[3, 3]
[0, 3]
[0, 3]
[0, 3]
[2, 1]
Now about your test. Here is a small scheme of your problem:
A = . B = + r = 2 <=> |-------| cloud1 ----0---1---2---3---4---5---6---7---8---9---10-- | 1 . . . . . . | 2 + cloud2 ----0---1---2---3---4---5---6---7---8---9---10-- | 1 . | 2 + +
Here is what the
cpp_neighbors.batch_query()
function did: find neighbors of the points in batch A (query) in the points of batch B (support), with a radius of 2. For each point of cloud1 in batch A the neighbors are:[] [] [] [0] [0] [0]
The only point of cloud2 in batch A have 2 neighbors:
[1, 0]
Now we stack these results, and be careful to offset the neighbors of cloud 2 so they still point to the right index:
A Neighbs B [ 0, 0, 0] [] [ 4, 1, 0] [ 1, 0, 0] [] [ 9, 1, 0] [ 2, 0, 0] [] [10, 1, 0] [ 3, 0, 0] [0] [ 4, 0, 0] [0] [ 5, 0, 0] [0] [10, 0, 0] [2, 1]
See how the neighbors from cloud2 in A are offset by 1 which is the length of the cloud1 in B. So that now in the full stacked batch the neighbors are not overlapping between two point clouds.
Eventually, we add the shadow neighbors to make a nice matrix that can be used by GPU. THe shadow value here is 3 because it is the length of batch B.
[3, 3] [3, 3] [3, 3] [0, 3] [0, 3] [0, 3] [2, 1]
Hi @HuguesTHOMAS ,
Wow that's cool! Great thanks for your amazing explanations, it made me completely understand what the function did!
I will try my best to get more excellent results, and I will share it with you :)
Thaks a lot for your patient and professional reply!