Cylinder3D
Cylinder3D copied to clipboard
Any idea to make the inference faster ?
Really thanks for sharing this awesome code !
I'm really looking forward to do some experiments with this great model, but now I'm struggling with the run time.
In my environments with RTX 3090, only forward pass takes 80 ~ 90ms.
I want to make the entire process (pre-processing + forward pass + post-processing) within 100ms, which is the sampling rate of my LiDAR. (Under 80ms will be better for the margin.)
I can make some trials such as reducing the range and resolution of the point clouds to make pre/post processing faster.
But I'm not sure how to make the model forward pass faster.
Can you give me some advice ? Thanks !!
Hello @CCodie,
Can you share some details about how you managed to run this on an RTX 3090? Library versions, CUDA version, and so on. And did you make any changes to the code? I am trying to run this on an RTX 3060Ti but I can not get it to run with CUDA 10.2 and changing versions of the libraries seems to have been an issue for many people.
To make the forward pass faster, I guess you could try to reduce the size of the network and retrain it and see if you can keep a similar performance while being faster.
Hello @mpQuintana , Actually I wanted to ask that there's some configurations to easily reduce the size of the network, but thanks for your advice !
I'm developing under Ubuntu 20.04 / RTX 3090 / CUDA 11.1 and compatible version of PyTorch, torch-scatter. I remember that getting the proper version of spconv library was quite tricky. Like using spconv version 2.x.x and change some code to make it run, but it gave me totally different results. I solved that problem just using spconv 1.2.1 which is exact same version of Author used.
@CCodie unfortunately, spconv-v1.x is quite slow and deprecated. I have forked the project to implement the spconv-v2.1.x version (not tested yet with v2.2.x). At home, I am getting 2-3x speedup with an RTX3060, which will be even faster with spconv-v2.2.x
This repo also converts the tensortypes inside the training loop (not the dataloader) and calculated validation on cpu (without batch support). Speedups are possible there.