WIP: Material system: Implement GPU occlusion culling
Some steps to implement occlusion culling in compute shaders using a Hierarchical Depth-Buffer.
The depth buffer is processed before the post-processing pass, reducing it to the next mip level until a 1x1 resolution is reached. For bounding spheres that pass the frustum test, project their AABB into screenspace and test against the Hi-Z buffer.
(sphere projection is broken right now, hence why it's a draft; also no testing needed until that is fixed)
The sphere projection function is giving incorrect results apparently, since I'm pretty sure the world->view space conversion is correct. I'm also not seeing much of any difference in frame time between this change and frustum culling-only: the frame time that I win in having less draws is nullified by the hi-z pass. It could be improved a bit by making the depth buffer downsampling be done in one compute dispatch, but it might just not really be worth it. On the other hand the hi-z buffer could be used for some other effects too.
Well, I'm seeing some improvements in certain cases, especially with high graphics settings. The whole thing also isn't taking too much frame time in itself anyway (the indirect draws + cull compute).
So the lower framerate that illwieckz was getting in #1137 might have to do with something else. Ideally we could also use better synchronization, but I'm not sure that would be possible.
Possibly rasterizing the bounding boxes with colour/depth writes disabled to determine object visibility would be faster. The depth pre-pass is super fast wherever I looked at it here, so this should be fast too. It does, however, mean that object visibility would be lagging behind by a frame, but that's already happening with this hi-z algorithm.
It would also require a bunch of writes to a buffer from every rasterized fragment that passed the depth test, which is not ideal. Perhaps a custom rasterizer in a compute shader that would instead write the visible object ids once per triangle or a batch of triangles would be better.
There are a few more things that could be improved independent of occlusion culling that might make this work faster in every case anyway:
- Use batches of e. g. 64 triangles instead of specifying surfaces, which would allow dropping the
processSurfaces_cpshader and doing writes in thecull_cpshader instead. - That would also allow creating an index buffer instead of an indirect buffer, for the most part. There would still be one draw per material, but that draw would have all the triangles in it (that weren't culled).
- Potentially with triangle batches culling at triangle batch level might speed things up even more: e. g. with cone culling.
Another potential advantage of compute rasterizer is assigning light sources to a tile/cluster buffer, though with low light counts it might not make much of a difference.
Doing hardware rasterization of BBoxes and writing to a buffer from the fragment shader with an early discard doesn't work well for some reason.
I believe this fixes the incorrect projection. However there is still some incorrect culling, which might be due to incorrect depth downsampling with even/odd dimensions, haven't checked yet.
This mostly works now, however there's still an error with AABB depth projection, and some things need to be un-hardcoded.
Added some comments and made r_lockPVS 1 work with occlusion culling.
Rebased.
Cleaned up a bit, and added a toggle through r_gpuOcclusionCulling.
Lol I wanted to approve https://github.com/DaemonEngine/Daemon/pull/1137 instead… 🤦♀️️
This works by creating a depth-reduction image from the depth buffer, then using it on the next frame.
How can data generated on the previous frame be valid? Everything could be in a different position or viewed from a different angle.
How can data generated on the previous frame be valid? Everything could be in a different position or viewed from a different angle.
That's a mistake in pr description, what is meant is that the results of the culling are used on the next frame, i. e. double-buffered (the depth buffer is used from the current frame).
Fixed pr description.
I also noticed some new GLSL files use spaces or seem to mix tabs and spaces, I would prefer tabs.
Ah, yep, that should be fixed now.
I've added a portal readback as well: now instead of just calling R_MirrorViewBySurface() for every portal surface after sorting them by distance each frame, the cull_cp shader will cull portal surfaces in addition to regular ones, then the CPU will read the results back next frame and use that to determine which portals might be visible.
Adding R_SyncRenderThread() to GenerateWorldMaterials() also seems to have fixed crashes with r_smp 1, so I have disabled that restriction.
Could we get a comment somewhere with an overview of how the double buffering works and what is the benefit of it? Like is there an async API for the compute shaders? If world surfaces are double-buffered but models are not, how do we keep them in sync when rendering?
There's this graph from #1137 which holds true here as well:
The benefit is not introducing stalls with extra synchronisation, so they can keep executing alongside the regular pipeline (including
swapBuffers()). Compute shaders are async by nature, so nothing extra needs to be done for that.
Double-buffering only applies to which surfaces are rendered, not how they are rendered. And since world and model surfaces are culled in completely different ways synchronisation between them does not make sense.
LGTM, but I would still like to understand the double buffering thing. I guess I never really understood the answer to this previous conversation:
How can data generated on the previous frame be valid? Everything could be in a different position or viewed from a different angle.
That's a mistake in pr description, what is meant is that the results of the culling are used on the next frame, i. e. double-buffered (the depth buffer is used from the current frame).
I don't get how surfaces can be double buffered without also buffering everything else that may be rendered (particle effects, 2D drawing, etc.). If you used the surface visibility from an earlier time then the player could have moved to a point that makes some of them newly visible.
(rebased)
If you used the surface visibility from an earlier time then the player could have moved to a point that makes some of them newly visible.
Technically yes, but practically the fact that it's only one frame of difference makes it more or less impossible, even when moving around fast. So far I've not seen any discrepancies when using the engine with this change.
It's also aided by the fact that the cull takes bounding spheres of the surfaces (either directly or projecting an AABB from them), so the culling somewhat coarse.