Falcor CSM shadow performance issues

I was curious about the poor performance so I recently profiled it and have probably discovered some performance bottlenecks. I didn't implement any of these changes for now but I'd like to bring attention and get the conversation going.

Vulkan

In Vulkan release builds, ForwardRenderer using default setting, shadow pass takes ~10ms on a 2080Ti. Most of the time (~8ms) is spent in the getCascadeParamsAndCheckForBlend() function in CascadedShadowMap.slang. I manually inlined the getCascadeIndex() function and it helped a bit. Then I realize the CsmData struct is probably too large. Changing CSM_MAX_CASCADES from 8 to 4 brings the time down to 4.6ms.

D3D12

In D3D12, apart from the same register pressure problem seen in Vulkan, I also noticed that the generateMips() function has a huge cost on the GPU. I think it's because:

there's too many software blit going on - number of slices * number of mips, which is set to maximum.
setGraphicsVars() in RenderContext.h always sets mBindGraphicsRootSig to true, and blit() pushes and pops graphics vars every time causing a bind root signature operation.

Conclusion

We need to be frugal about the CsmData struct and number of mips required by some shadow map filtering techniques.
generateMips() and blit() in D3D12 needs to be improved
Falcor internal code should do a better job at reducing redundant state change, both at RHI level and high level code (ex. use pushGraphicsVars() sparingly).

Apr 06 '19 04:04 philcn

Thanks for the detailed analysis. We're working on Falcor 4.0 which will have many optimizations to CB assignments. We also plan to leverage Slang's ParameterBlock to reduce the CPU overhead of initializing and binding ProgramVars.

We plan to have a beta version ready around SIGGRAPH (Late July`19).

Apr 08 '19 19:04 nbentyNV

Thanks for the update.

From my experience using Falcor, CBs and redundant state changes are the two main sources of CPU overhead. I'm glad that you're working to make these improvements. Additionally, two things I'd like to see in Falcor are better multi-context support for parallel command encoding and a scene renderer that doesn't upload as much per-draw CBs (via instancing, bindless, etc). Any chance Falcor has any of those planned for 4.0?

Finally, how much API change should I anticipate for 4.0, Assuming I'm working at the scene renderer level?

Apr 09 '19 08:04 philcn

Yes, we plan to drastically reduce the CPU overhead. We're not going full-bindless, but the plan is to have a pre-generated descriptor set for Scene, which includes all the materials, mesh information, etc. Then the SceneRenderer only needs to set a single value into the root-signature per execute() call.

Once we do that, it should be simple to change the SceneRenderer to pre-record a command list.

I don't have anything on my plate for better multi-context support. AFAICT it works as expected. If you have any feedback/request, please let me know

Apr 17 '19 21:04 nbentyNV

I'm glad to know the plan on low overhead scene renderer, that's very relevant to the work I've been doing.

As far as multi-context rendering in Falcor, I'll try it out and let you know if I run into any problems, but I think in the long term having a sample would be quite helpful.

Apr 17 '19 21:04 philcn

Falcor Falcor copied to clipboard

CSM shadow performance issues

Vulkan

D3D12

Conclusion

Falcor
Falcor copied to clipboard