Core: Add `Node::iter_children` as a fast way to iterate a node's children
- Follow-up of https://github.com/godotengine/godot/pull/106224
This adds a performant way to iterate a node's children. This was discussed on RocketChat.
Currently, you either iterate get_child_count using get_child, adding immense overhead, or you iterate get_children(), which allocates a whole new array. Both are very slow.
ChildrenIterator is technically not needed (Node ** could be iterated instead), but is added to hide the nature of the children storage from the caller. In the future, we may want to change storage without needing to change caller code.
Additionally, I'm adding an Iterable template, such that children are iterated on node.iter_children() instead of on node. If node was iterated, it is not immediately clear that the children are the iteratees. Also, checks would have to be run twice otherwise (for begin() and end() separately), making it unnecessarily slow.
Benchmark
Debug
before : 32-35 fps after : 23-24 fps
Release
before : 55-58 fps after : 46-50 fps
We did suspect this might be slower in debug but it also seems slower in release than the previous version (accessing the child cache directly via Span). I'll try and pin down what is slowing it down.
Here's a profile. Notably I'm finding that a lot of small functions aren't getting inlined .. turns out it is likely because of thread guards, which I created an issue for #107393 .
First obvious thing is it seems to be branching on p_internal. This could be due to the function not being inlined. It might be worth experimenting with putting this branch in the header, and if necessary calling two versions of the function (if desired in the cpp), as we should be able to get this to resolve at compile time.
Here's a profile. Notably I'm finding that a lot of small functions aren't getting inlined .. turns out it is likely because of thread guards, which I created an issue for #107393 .
First obvious thing is it seems to be branching on
p_internal. This could be due to the function not being inlined. It might be worth experimenting with putting this branch in the header, and if necessary calling two versions of the function (if desired in the cpp), as we should be able to get this to resolve at compile time.
Thank you for profiling! Maybe i'm just not used to it, but the profile results are curious to me. It seems like most of the performance lost is scattered around the function in various spots. I would definitely have expected the thread guard to be by far the most expensive thing in this function, at least in debug. The only other reason I can imagine is cache misses for the accessed data, but that shouldn't much affect the fps since it's accessing the same data as before.
Anyway, I agree that p_include_internal doesn't need to be a runtime variable. I made it a template instead, and also eliminated the 0 branch which wasn't needed. I wouldn't expect this to make a big difference normally, but perhaps it does in all summation?
Maybe i'm just not used to it, but the profile results are curious to me.
It's definitely worth repeating on your end (maybe with different benchmarks, loops). I do a lot of profiling, but not in master, and the scons options are not super familiar to me, so it's possible I've messed something up in the build. It does seem to be inlining some functions (confirming release build) but it's not doing a very good job with this one.
It seems most of the slowdown is due to the thread guards. We have discussed removing them for this high performance iterator, and I'll check the performance with current master.
Sorry @Ivorforce this will need a slight rebase as I fixed the get_child() in the meantime.
Thanks!
