godot
godot copied to clipboard
Optimize RenderForwardClustered::_setup_render_pass_uniform_set by reducing Vector allocations
Optimize RenderForwardClustered::_setup_render_pass_uniform_set by reducing Vector allocations
This PR optimizes the RenderForwardClustered::_setup_render_pass_uniform_set function by reducing unnecessary vector reallocations.
Previously, the function heavily relied on Vector::push_back to populate the uniforms vector, consuming 1.47% of CPU time according to Hotspot profiling. This PR addresses the issue by:
- Replaced Vector with
thread_local LocalVector<..>. It allows us to allocate memory once and then re-use it without additional allocations. - Replaced
get_uniformsfunction withappend_uniforms. New function is directly adding values to the collection instead of copying them between collections.
These changes significantly reduce the CPU overhead, bringing down the time spent in LocalVector::clear() to 0.148%. get_uniforms was also removed and reduced CPU from 0.668% to 0.03% in append_uniforms.
Attached are Hotspot screenshots demonstrating the performance improvement:
Before
After
Testing
Created empty 3D scene with Vsync turned off. Run in DEBUG mode without any logic and any object.
Such a change, despite indeed making performance better, makes the code much less maintenable due to the hardcoded size and indices. The right way of addressing this would be the same in, for instance, https://github.com/godotengine/godot/blob/97b8ad1af0f2b4a216f6f1263bef4fbc69e56c7b/servers/rendering/rendering_device_graph.cpp#L1838-L1848.
Also, since I'm pretty ceratin there will be more cases like this, I'd suggest fixing a bunch (or all?) of them in a single PR.
The right way of addressing this would be the same in, for instance,
I know nothing about it but should it actually be cleared and constantly resized if the size in known? Resizing and direct assignment will be much faster, won't it?
The right way of addressing this would be the same in, for instance,
I know nothing about it but should it actually be cleared and constantly resized if the size in known? Resizing and direct assignment will be much faster, won't it?
LocalVector has the notion of capacity so it can be cleared without actually resizing the allocation. Vector doesn't provide such a feature, but it should be possible to change the APIs taking a Vector so they take a LocalVector instead. That would be the case of get_cache_vec() here. As long as it's not exposed, that should be safe. Also, LocalVector is more performant than Vector in that it doesn't have the overhead of copy-on-write semantics.
Amended to resolve the requested style fixes.
Thanks!