Falcor
Falcor copied to clipboard
CSM shadow performance issues
I was curious about the poor performance so I recently profiled it and have probably discovered some performance bottlenecks. I didn't implement any of these changes for now but I'd like to bring attention and get the conversation going.
Vulkan
In Vulkan release builds, ForwardRenderer using default setting, shadow pass takes ~10ms on a 2080Ti. Most of the time (~8ms) is spent in the getCascadeParamsAndCheckForBlend()
function in CascadedShadowMap.slang. I manually inlined the getCascadeIndex()
function and it helped a bit. Then I realize the CsmData
struct is probably too large. Changing CSM_MAX_CASCADES
from 8 to 4 brings the time down to 4.6ms.
D3D12
In D3D12, apart from the same register pressure problem seen in Vulkan, I also noticed that the generateMips()
function has a huge cost on the GPU. I think it's because:
- there's too many software blit going on - number of slices * number of mips, which is set to maximum.
-
setGraphicsVars()
in RenderContext.h always setsmBindGraphicsRootSig
to true, andblit()
pushes and pops graphics vars every time causing a bind root signature operation.
Conclusion
- We need to be frugal about the
CsmData
struct and number of mips required by some shadow map filtering techniques. -
generateMips()
andblit()
in D3D12 needs to be improved - Falcor internal code should do a better job at reducing redundant state change, both at RHI level and high level code (ex. use
pushGraphicsVars()
sparingly).
Thanks for the detailed analysis.
We're working on Falcor 4.0 which will have many optimizations to CB assignments. We also plan to leverage Slang's ParameterBlock
to reduce the CPU overhead of initializing and binding ProgramVars
.
We plan to have a beta version ready around SIGGRAPH (Late July`19).
Thanks for the update.
From my experience using Falcor, CBs and redundant state changes are the two main sources of CPU overhead. I'm glad that you're working to make these improvements. Additionally, two things I'd like to see in Falcor are better multi-context support for parallel command encoding and a scene renderer that doesn't upload as much per-draw CBs (via instancing, bindless, etc). Any chance Falcor has any of those planned for 4.0?
Finally, how much API change should I anticipate for 4.0, Assuming I'm working at the scene renderer level?
Yes, we plan to drastically reduce the CPU overhead.
We're not going full-bindless, but the plan is to have a pre-generated descriptor set for Scene
, which includes all the materials, mesh information, etc. Then the SceneRenderer
only needs to set a single value into the root-signature per execute()
call.
Once we do that, it should be simple to change the SceneRenderer
to pre-record a command list.
I don't have anything on my plate for better multi-context support. AFAICT it works as expected. If you have any feedback/request, please let me know
I'm glad to know the plan on low overhead scene renderer, that's very relevant to the work I've been doing.
As far as multi-context rendering in Falcor, I'll try it out and let you know if I run into any problems, but I think in the long term having a sample would be quite helpful.