godot
godot copied to clipboard
Make `AStarGrid2D` class less memory consuming & multithreaded friendly
This will make the points in that class to not keep the temporary information. This should decrease memory consumption taken by this class (from 56mb to 25mb on 1000x1000 grid). Instead, that information will allocate on demand in _solve
function, and will be released when this function ends. The get_id_path
/get_point_path
as well as _solve
and some others will receive a const
tag, so now you can potentially call it in multiple threads using a single AStarGrid2D
instance.
Also apply changes from https://github.com/godotengine/godot/pull/70485.
If this change become welcome, I will think to make a similar change to AStar3D/2D.
Note: However, the average performance cost (in milliseconds) of calculating the path is increased from 0.0005 to ~0.0025~ 0.0012 (on 1000x1000 grid, to find the path from (0, 0) to (999, 999) without jumping). But I think the benefits to make it less memory consuming and multithreaded overwhelm it. The further step I planned is to implement the HPA (hierarchical pathfinding) to significantly decrease the search time on large grid.
Have you got a test MRP that shows the difference? That would be useful to profile and see the difference. :+1:
@lawnjelly ah, yes, it's simple - TestAStarGrid.zip
So when it's compiled at current master, the result will be:
With this PR:
Which is worse that previous, but AStarGrid2D with this patch will support multithreading and consume much less memory.
I'll post here some profiles.
Here is the original code running a slightly modified benchmark above:
Orig code
One thing that jumps out immediately is that _get_nbors()
is using a linked list, and I noticed this is the same in your PR. Linked lists should generally be avoided in bottleneck code, because of cache coherence and here the cost of making allocations / frees when adding / removing from the list (there is no pooling).
One easy thing here immediately would be to replace with a LocalVector
, probably pre-reserve and reuse the same local vector on each run. As you can see from the profiles the new and frees seem to be taking 30% plus of the time.
I'll now profile the PR...
PR code
Update
Yes, I can confirm to start with, changing nbors
to a LocalVector
, reusing on each run and just calling clear()
rather than using a linked list reduces the time for the PR run from 20.82 to 15.37, so well worth doing.
The bottleneck is still mostly in memory allocation / freeing, presumably from RBMap
. In general the first thing we should be trying is to avoid dynamic allocation in bottlenecks (except if using e.g. pools which are O(1)
).
I'm not absolutely sure RBMap
is a good data structure to use yet. It does have the ability to override the allocator so it could use a pool, but I suspect one of the new robin hood structures reduz did might be better. For smaller grids you could just literally use an array for lookup of whether a cell had been used etc, but it doesn't scale, so I see the desire for some kind of map.
RBMap custom allocator
Using a PagedAllocator
for RBSet brings the time down from 15.37 to 12.26.
Example:
class AStarAllocator {
struct Dummy {
uint8_t data[88]; // You would want to fix this in terms of the sizeofs ideally
};
static PagedAllocator<Dummy> paged_allocator;
public:
_FORCE_INLINE_ static void *alloc(size_t p_memory) { return paged_allocator.alloc(); }
_FORCE_INLINE_ static void free(void *p_ptr) { paged_allocator.free((Dummy *)p_ptr); }
};
PagedAllocator<AStarAllocator::Dummy> AStarAllocator::paged_allocator;
...
RBMap<Vector2i, Pass, Comparator<Vector2i>, AStarAllocator> data;
The only thing here is the PagedAllocator
would be using memory when not in use as static, but I think there is a reset()
function you could call to clear memory use after _solve()
.
More
After trying a bitfield to store whether points have been visited, I see that the entire grid of points seems to be stored in the class: So I'm not sure why we need to use a map at all? The AStarGrid
seems to be storing a LocalVector
of all the points on the grid in the class (well actually a local vector of local vectors, I don't know what that's all about), so it would seem we can use direct lookup instead of a map.
Once the other stuff is taken care of, the RBMap
is the most expensive thing, and if it can be removed and replaced with a direct lookup that is better. All we need is a sorted list of points on the open list.
Yes, I can confirm to start with, changing nbors to a LocalVector, reusing on each run and just calling clear() rather than using a linked list reduces the time for the PR run from 20.82 to 15.37, so well worth doing.
Thanks! Changed.
So I'm not sure why we need to use a map at all?
Because otherwise, multithreaded will not be possible?
So I'm not sure why we need to use a map at all?
Because otherwise, multithreaded will not be possible?
I have a sneaking suspicion that the solid
and weight_scale
can be stored one off and reused for each thread, then for each solve you could use unique visited
, closed
bitfields (for direct lookup) and then just a sorted open list, and no map. I'll try this out and see if there's any joy, for comparison.
EDIT: Ah I get it, it's the lookups of the pre-visited neighbour nodes that is the problem that has necessitated the map. I'll try and work out if there's a cheaper way around this.
Out of interest I did a little experimentation with this for fun. I managed to get it down to less than 5.0 using the benchmark (although I'm not sure on the timings, something seems to be causing some random effects, and I'm too tired to investigate), and uses around 20 megs less memory at runtime. I'm not intending to takeover this, and my branch is not all that tidy, and doesn't contain the jump
code, but for reference here is my branch:
https://github.com/lawnjelly/godot/tree/astar2d_optimize
Key things:
- Given that the class is going down the route of storing things for every grid square, it is possible to avoid using the slow
RBMap
by just having a lookup grid (1000x1000 in this case) of IDs for passes on each grid square. This uses about 500Kb, and doesn't need to be zero initialized. Of course here we are potentially trading off cache misses for slow map structure, so profiling is key. - The state of whether a square has been visited or not, and whether it is on the closed list, can be stored compactly using a 2D bitfield. I tend to use these extensively, and extended the class I already use in 3.x. The 2 bitfields do have to be initialized to zero, these are each around 122Kb. The cost of clearing temporary data is significant, so the use of bitfields ensures good packing.
- The most expensive thing now is maintaining the sorted open list queue, I'm using the same code I think as was already used as the heap sort is a lot more efficient than naive solutions.
Interestingly, aside from storing the solid / non solid status and weights for gridsquares, there's nothing else that really needs to be using memory scaling with the grid size. You could in theory have super large grid sizes and not run into memory squared issues. However I expect it would be used with tilemap or something so that would use up memory with large grids anyway.
As with navigation pathfinding, I would always suggest for A star etc to be able to do async pathfinding, as usually there's no problem with spreading the computation over a few frames, whereas trying to cram everything in a single frame can often lead to stalls.
I think for large grids simplifying the search space could be worth investigating. This is something the jump
code seems to try and make more efficient. One thing that occurred to me is you could quite easily build a series of rects being regions that contain continuous non-solid regions. This means you could pathfind from any edge to another without checking intermediate tiles. Another option is keeping a distance field of distance to solid for each grid square. The ability to weight grid squares complicates both of these though. Also hierarchical pathfinding is often used.
Now looking at the original code, and the use of RBMap
in this PR, it looks like the original version is closer in spirit to what I ended up with.
So unless the intention was to reduce data storage for very large grids (and prevent scaling of data usage with grid size), I would personally steer away from using a map, and separate the original code into data that needs to be stored common to all threads (solid and weights), and data that is only needed for the solve()
(sorted open list, closed status, and lookup to find already visited passes).
Also if a map
was used, I would double check with reduz whether RBMap
would be most efficient for this purpose. He did do a robin hood hashmap recently, and started an unordered version (although that may not have been finished).
Okay, that would require a lot more work, and currently I lost interest a bit. Maybe later I would touch this again (after 4.0 is released).