godot
godot copied to clipboard
Implement array based hash map `AHashMap`
Currently, HashMap is implemented as a double linked list, which creates a significant problem with CPU cache pollution, as we store not only the necessary data but also pointers and allocator metadata. There are also costs for searching for free memory in the heap.
Implementation view of list and array based hash map.
ListHashMap:
AHashMap:
When to use HashMap instead of AHashMap:
- You need to keep an iterator or const pointer to Key and you intend to add/remove elements in the meantime.
- You need to preserve the insertion order when using erase.(Erase at the position of the deleted element inserts the element from the end)
<-------------
| |
6 8 X 9 32 -1 5 -10 7 X X X
6 8 7 9 32 -1 5 -10 X X X X
What did I do in this pr:
I implemented the structure, added tests, and replaced maps that I thought needed to be replaced in the animation code.
Benchmarks:
I made 3D animation benchmarks (many 3D models play the same animation): Animation_test.zip
ListHashMap(Master): Project FPS: 77 (12.98 mspf) Project FPS: 78 (12.82 mspf) Project FPS: 78 (12.82 mspf) Project FPS: 79 (12.65 mspf)
AHashMap(Current PR): Project FPS: 103 (9.70 mspf) Project FPS: 104 (9.61 mspf) Project FPS: 103 (9.70 mspf) Project FPS: 103 (9.70 mspf) Project FPS: 102 (9.80 mspf)
Here you can see a 27% increase in performance.
List/array HashMaps benchmarks KeyValue<long, long>. Benchmarking shows 2000% faster iteration, 25% faster random search, and 300% faster insertion compared to the current HashMap implementation. AHashMap<long, long> uses 2.5 times less memory than HashMap<long, long>.
Projects in which I tested for regression From demo projects Voxel, bullet shower, platformer2/3d, particles 2/3d, and some of my own projects.
BugSquad edit: Closes #93663
I did a brief look through the implementation, and there is quite a bit of duplicate code between this and the original HashMap implementation.
Additionally, the new implementation uses powers-of-2 in the modulo calculations, while the old implementation uses something called fastmod. Since integer division is slow, and that almost every function uses modulo including in while/for loops, using powers-of-2 for modulo alone can improve performance quite a bit. For all we know, that is where we are seeing the performance improvement, and not from storing the elements in an array.
I think applying the performance improvements (besides storing elements in an array) to HashMap should be done so that the only difference is that AHashMap stores elements in an array. This way we can confidently assert that the performance gain is from removing cache misses and not from something else.
@dhoverml see this PR : #90082
I just made benchmarks for animations for that PR. There is an improvement of 8-9% comparing with master. It should also be taken into account that there is a change in the hasher, which also has a significant improvement effect.
Oh nice, that's great! Sorry if I missed it, but have you compared #90082 with this PR?
Oh nice, that's great! Sorry if I missed it, but have you compared #90082 with this PR?
If this PR adds 23%, and that one adds 8%, then the difference will be 15%. Both they give about 25-26%.
Looks interesting... and what the difference between this and OAHashMap? And where to use this and where OAHashMap? Can it be used in AStar3D class instead of OAHashMap for points to increase its performance?
I took a look at the hashmap implementation and haven't noticed any issue with the code (not a fan of robin hood hashing tho). I do have one suggestion, as @dhoverml pointed out, this AHashMap shares a lot of code with original HashMap, it is possible to unify the two classes by adding a new conditional template parameter, something like template <typename TKey, typename TValue, typename Hasher = HashMapHasherDefault, typename Comparator = HashMapComparatorDefault<TKey>, bool IsArrayedHashMap=true> class HashMap. It should be able to deduplicate the code.
@rossbridger This is not possible because the code is very different. Different iteration algorithm, different attributes, different implemented functions.
@Chaosus
what the difference between this and OAHashMap?
The difference is that the elements in A are stored linearly as in a dynamic array. In OA, the elements are stored in the array at the position where their index matches the index where the hash is stored. In A is stored the index of the element in array next to the hash.
And where to use this and where OAHashMap?
It is better to use OA where the size of the pair is < 16 bytes and the classes of keys and values do not have constructors and destructors (they are trivial).
Can it be used in AStar3D class instead of OAHashMap for points to increase its performance?
Yes, it can.
Testing on a s23 + qualcom adreno 740
Master 28-30 dropping even to 25 fps
This pr : 37-38 or 39 dropping to 36 , prettt good increase 😀, can't test the blend process ome because theres no android editor artifact. Using calinou test project https://github.com/godotengine/godot/files/15501000/Animation_test.zip
@rossbridger This is not possible because the code is very different. Different iteration algorithm, different attributes, different implemented functions.
I think it's possible, you just need to use some C++11 template metaprogramming features like std::conditional to instantiate different data structures and algorithms at compile time as long as public function signatures are the same. That being said, if you don't think it would improve maintainability, it's fine having a separate class for .
Also, do you think we can replace the HashMap with AHashMap in core/object/worker_thread_pool.h and drivers/d3d12/rendering_device_driver_d3d12.h so that we can remove Allocator template parameter from HashMap? These are the only places with custom Allocator and I don't see the use of HashMap there requires preserving insertion order.
Also, do you think we can replace the
HashMapwithAHashMapin core/object/worker_thread_pool.h and drivers/d3d12/rendering_device_driver_d3d12.h
I don't think it is necessary. Since erase is often used in ThreadPool (didn't check, but maybe it is faster in normal map), and in d3d12 BarrierRequest weighs a lot.
So, at this point I think it would be better to use this map only for animations and not for the whole engine to avoid regression and make the review process easier. And the use for all other cases should be done in other PRs after merging this PR.
@TokageItLab for tracking animation performance improvements
I don't think it is necessary. Since
eraseis often used in ThreadPool (didn't check, but maybe it is faster in normal map), and in d3d12BarrierRequestweighs a lot.
So there's also an element size threshold. Could you maybe add that to When to use HashMap instead of AHashMap in the PR description? I have the feeling that many will check that section from now on when in doubt. I'm among them. The thing is, while it seems fine to keep this PR scoped to animation, contributors to other areas will like to assess this new map for future features or upgrading current ones.
@RandomShaper There is no limit to the size of an element. Just in my opinion, reallocating a large array can take more time than just creating a new element.
In the description, I wrote mandatory items. But I will try to add recommended ones.
Any reasonable guideline like that, even if subject to specifics of each case, will help. (In any case, I still believe that every map—and every container class, for that matter—should have a bulleted list about its traits: iterator invalidation, sample usage situations—generic—, etc., but that's aside.)
Made a description of the map in the header file.
Off topic , but i feel some curiosity why before it said 57 before and now in that same case says 76, without mentioning the after result because with them looks almost double the original performance of before( 99fps).. Is there any that improved that?
@Saul2022 This #97744 and this #92838 improved master performance, so I updated it. And #92838 actually improved performance a bit better than the PR description says.
Also, this week I've been working on using the array methods (get_index, get_by_index) in the animation code to avoid using map lookups in bottlenecks.
Thanks!