gsplat icon indicating copy to clipboard operation
gsplat copied to clipboard

nd rasterizer is 10x slower than rasterizer

Open zubair-irshad opened this issue 2 years ago • 11 comments

Hi, Great work! nd rasterizer is around 10x slower than sh rasterizer. To be precise, my model inference time with sh rasterization is 0.008s which gives me >100FPS as described in the original gaussian splatting paper but just adding nd rasterizaiton reduces it to 0.075 s and 13 FPS.

Is there a way to make it better? Any intuition would be greatly appreciated. With nd rasterization, it looks like we lose the benefits i.e. speed of gaussian splatting. Thank you again for the awesome work!

zubair-irshad avatar Nov 02 '23 22:11 zubair-irshad

Hi! What N are you using? In rasterization, each pixel requires an N-d array of workspace memory. For RGB, we can fit that in register memory, and can specify this statically at compile time. We wrote N-d for the case that the necessary workspace exceeds available register memory, and must be in global memory. This means we can't make the same kinds of optimizations in the RGB rasterizer. If this is the case for you, then you can either stick with the global memory situation, or you can rasterize in batches with the current optimized RGB rasterizer (channels 0-3, 3-6, etc). We're considering adding an in-between version of the rasterizer for MAX_REGISTER_CHANNELS=16 with similar optimizations to the RGB rasterizer.

On Thu, Nov 2, 2023 at 3:52 PM Zubair Irshad @.***> wrote:

Hi, Great work! nd rasterizer is around 10x slower than sh rasterizer. To be precise, my model inference time with sh rasterization is 0.008s which gives me >100FPS as described in the original gaussian splatting paper but just adding nd rasterizaiton reduces it to 0.075 s and 13 FPS.

Is there a way to make it better? Any intuition would be greatly appreciated? With nd rasterization, it looks like we loose the benefits i.e. speed of gaussian splatting. Thank you again for the awesome work!

— Reply to this email directly, view it on GitHub https://github.com/nerfstudio-project/gsplat/issues/68, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABLOKW3JAPS4MXO7IX2BRLTYCQP4PAVCNFSM6AAAAAA63TUTHCVHI2DSMVQWIX3LMV43ASLTON2WKOZRHE3TKMJWGQ3DKMA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

vye16 avatar Nov 02 '23 23:11 vye16

Thank you for the great intuition and detailed response. My channel size is currently 29 but I am considering increasing the feature size to 128 or even 256 which my worry is it will be slower than 13 FPS. I will try the batched RGB rasterizer as you suggested in a for-loop manner and see if it gives a higher FPS, thank you!

zubair-irshad avatar Nov 02 '23 23:11 zubair-irshad

@vye16 Reporting back what I found. Implementing a for loop to rasterize multiple channels in batches i.e. 0-3, 3-5 etc instead of ND rasterization is slightly worse in performance and I didn't find it to improve performance. My guess is due to the for loop which has to run 10 times for the channel size I am trying i.e. 30. Any other intuition to improve performance is greatly appreciated, thank you!

Just to provide more specifics, per iteration time for 640 by 480 image for nd rasterization with N=30 is ~74-76ms, with batched (the one I shared above is 82-85ms, with just sh i.e. 3 channel rendering it is 16ms. The same results translate to fps numbers during inference i.e. 13fps for nd_rasterization with N=30 vs >100fps for sh rasterization only

zubair-irshad avatar Nov 03 '23 03:11 zubair-irshad

Update: with batched implementation fps increased to 25 though it is still quite less than >100 for the original rasterizer implementation

zubair-irshad avatar Nov 03 '23 04:11 zubair-irshad

@vye16 @maturk Any plans on supporting larger register channels i.e. MAX_REGISTER_CHANNELS>3 perhaps 16 or 32 to achieve same level of optimization that native sh rasterizer gives? I am happy to create a PR. Though just increasing this number gives some errors elsewhere for instance AT_ERROR("v_colors must have dimensions (N, 3)"); Should I change anything else in the CUDA code to achieve this?

I am wondering if there are any downsides of specifying 128, 256 or 512 MAX_REGISTER_CHANNELS, would it affect the memory? I think GPUs with larger sizes can support this? Any intuition is greatly appreciated.

zubair-irshad avatar Nov 06 '23 21:11 zubair-irshad

Hi Zubair, sorry for the late response. Currently the color rasterization represents color in float3 (CUDA vectorized type). We can make a version that accepts N-d colors up to ~32 channels that could fit in shared memory during rasterization. Unfortunately 128, 256, 512 would be too big to fit in shared memory in one pass, but it is possible to rasterize them in batches of channels (0-32) that fit in shared memory. This is unlikely to reach similar performance, but would be better than the current ND-rasterizer. In the near-term we're not currently working on it, but I'm happy to guide you if you'd like to make a PR.

vye16 avatar Dec 06 '23 15:12 vye16

Thanks @vye16! I am happy to work on it and make a PR. Any pointers on where I start/which parts I look at changing first would be appreciated, thanks a lot!

zubair-irshad avatar Dec 08 '23 09:12 zubair-irshad

Any update to this issue? I am also working on rendering high dimensional features, and want to know how to speed up nd rasterizer

SeanGuo063 avatar Dec 21 '23 12:12 SeanGuo063

#130 works towards this issue, let me know if you try it out! @zubair-irshad

kerrj avatar Feb 14 '24 17:02 kerrj

This is great, I will check it asap. Thanks @kerrj.

zubair-irshad avatar Feb 15 '24 02:02 zubair-irshad

@zubair-irshad Hi! Great question. Is there any updates on CHANNEL > 3 cases? I also want to implement something with a much larger channel size. What's your current solution to this problem? :)

yiduohao avatar Sep 23 '24 21:09 yiduohao