Implement StaticBlockAllocator for HashMap, List, RBMap, RBSet.
About this PR
This PR improves performance and reduces RAM usage of the entire engine where HashMap, List, RBMap, RBSet are used. I implemented new StaticBlockAllocator for such structures as HashMap, List, RBMap, RBSet. I have also implemented a BlockAllocator which is used by StaticBlockAllocator.
How elements in structures allocated?
Structures like HashMap, List, RBMap, RBSet currently use malloc which is wraped into Memory class for allocate and free memory for a new element. As many know this allocator is a general purpose allocator. For these data structures, for each element, malloc allocates a lot of metadata and also allocates a little more space than we asked for in case we want to use realloc. That is, if we want to allocate 16 bytes of data, malloc will allocate 32 of which 8 bytes will go to store the size we allocated and 8 bytes for the realloc. It is obvious that we are not going to allocate more than what is allocated to the List element.
Memory means class Memory from core.
Custom allocators
The solution to this is to use custom allocators. Currently, godot uses PagedAllocator for RID allocation and for some HashMap. But this allocator is bad as the default allocator in that it needs to be initialized for each new instance, and the allocator itself weighs a lot, so it will not be suitable for scene use, since the nodes will weigh twice as much, and adding the first element to the Node will last a long.
So for default it turned out to be best to implement a static allocator that would have its own PagedAllocator for each element size. It turns out that PagedAllocator is not suitable for these purposes. So I used my BlockAllocator.
What is BlockAllocator?
BlockAllocator is a hybrid of PagedAllocator and StackAllocator. I created it about half a year ago as my first allocator. If an element is deleted, it is written to the free_list. This list is stored in place of deleted items, which saves a lot of memory. Until the size of deleted list is not equal to 0, then the pointer of our allocator will not move further. With each new page, the page size is doubled. It turned out to be even better than PagedAllocator.
How free and alloc works.
How blocks of blocks are interconnected.
What is StaticBlockAllocator?
StaticBlockAllocator is a static allocator. It allocates to our size an id that we use to allocate and free data. When requesting an ID, the allocator checks whether it has an allocator of the required size. If there is none, then he creates it. This allocator for each thread has its own array of BlockAllocator, with unique sizes. If two threads use size 32 in two different threads, two allocators with that size will be created separately.
What is TypedStaticBlockAllocator?
This is the final templated allocator for our structures. It weighs 8 bytes and stores the id provided by StaticBlockAllocator. According to this id, it allocates and frees data. Also calls the constructor and destructor for our classes.
Performance Tests
Let's start by testing List<int>.
Master branch:malloc implementation List<int> 10 000 000 elements.
- Time insert: 236674 usec
- Time iteration: 39287 usec
- Time clear: 141672 usec
- Time total: 417736 usec
TypedStaticBlockAllocator:
- Time insert: 141104 usec
- Time iteration: 25416 usec
- Time clear: 76181 usec
- Time total: 242765 usec
Direct Usage BlockAllocator:
- Time insert: 96636 usec
- Time iteration: 26894 usec
- Time clear: 39474 usec
- Time total: 163112 usec
Direct Usage PagedAllocator:
- Time insert: 106321 usec
- Time iteration: 31373 usec
- Time clear: 59099 usec
- Time total: 196889 usec
Memory Usage List<int> for 100 000 000 elements:
- BlockAllocator: 2,95 GB
- PagedAllocator: 3,62 GB
- Malloc: 4,63 GB
So, as you can see, my allocator works even better than PagedAllocator(although I thought it would be slower). It is worth paying attention to the improvement in iteration time, which means much faster work with allocated data, since it is more cache friendly.
Code used:
int var = 0;
template <typename T>
void test_list() {
T list_standart;
auto start = chrono::system_clock::now();
{
auto clock_start = chrono::system_clock::now();
for (int i = 0; i < 10000000; i++) {
list_standart.push_back(i);
}
auto clock_now = chrono::system_clock::now();
auto currentTime = float(chrono::duration_cast<chrono::nanoseconds>(clock_now - clock_start).count());
print_line("Time insert:", currentTime);
}
{
auto clock_start = chrono::system_clock::now();
for (const int num : list_standart) {
var = num;
}
auto clock_now = chrono::system_clock::now();
auto currentTime = float(chrono::duration_cast<chrono::nanoseconds>(clock_now - clock_start).count());
print_line("Time iteration:", currentTime);
}
{
auto clock_start = chrono::system_clock::now();
list_standart.clear();
auto clock_now = chrono::system_clock::now();
auto currentTime = float(chrono::duration_cast<chrono::nanoseconds>(clock_now - clock_start).count());
print_line("Time clear:", currentTime);
}
auto end = chrono::system_clock::now();
auto currentTime = float(chrono::duration_cast<chrono::nanoseconds>(end - start).count());
print_line("Time total:", currentTime);
}
void main_benchmark() {
test_list<List<int>>();
}
Dictionary test.
Dictionary[int, String]
Master branch:
- Time insert:2655 msec
- Time iteration:325 msec
- Time clear:801 msec
- Time total:5116 msec
Current PR:
- Time insert:2472 msec
- Time iteration:288 msec
- Time clear:500 msec
- Time total:3260 msec
Code Used:
func _ready() -> void:
var start_test :int = Time.get_ticks_msec()
var dict : Dictionary[int, String]
if true:
var start :int = Time.get_ticks_msec()
for i in 10000000:
dict[i] = " "
var end :int = Time.get_ticks_msec()
print("Time insert:", end - start)
if true:
var start :int = Time.get_ticks_msec()
for n in dict.keys():
pass
var end :int = Time.get_ticks_msec()
print("Time iteration:", end - start)
if true:
var start :int = Time.get_ticks_msec()
var num : int
dict.clear()
var end :int = Time.get_ticks_msec()
print("Time clear:", end - start)
var end_test :int = Time.get_ticks_msec()
print("Time total:", end_test - start_test)
Memory Usage Dictionary[int, String] for 100 000 000 elements:
- Master: 11.37 GB
- Current PR: 8.4 GB
Code used:
for i in 100000000:
dict[i] = " "
while true:
pass
The speed of adding children to the node.
Add 1000000 Nodes as child:
- Master: 3225423 usec
- Current PR: 2825762 usec
func _ready() -> void:
var start :int = Time.get_ticks_usec()
for i in 1000000:
add_child(Node.new())
var end :int = Time.get_ticks_usec()
print(end - start)
Some improved performance in MRPs
MRP from #85871. Improves performance and reduces memory usage in the editor.
- Master 3,73 GB
- Current PR: 2,97 GB
Memory usage can be checked via htop or task manager.
Until this #92554 isn`t merged. This PR improves animation performance by 5%.
Also, this MRP #93568 loads 5% faster. And in general, projects are loaded 5-10% faster.
Improving mesh generation
I used this demo. https://github.com/godotengine/godot-demo-projects/tree/master/3d/voxel. Here, the chart shows the before and after average mesh generation time for a chunk.
Better multithreading performance
This comment https://github.com/godotengine/godot/pull/97016#issuecomment-2356447368
Godot Benchmarks
- Master: master.json
- Current PR: multithread_allocator.json
Possible future PRs with the new BlockAllocator
- I suggest using the new BlockAllocator in
RID_Alloc. The plus side of this is that it is non-templated, meaning it won't generate new code for every new class that uses it, helping to reduce the size of the engine by a few megabytes as well as reducing CPU cache usage. - Use in the ClassDB to make it lock free and multi-thread safe.
I did the godot-benchmark test. Compile command:
scons CC=gcc-12 CXX=g++-12 scu_build=yes scu_limit=1024 lto=full linker=mold
~~master.json~~ ~~StaticBlockAllocator.json~~ Edit: Old benchmarks, new benchmarks are in the PR description.
Something I noticed is that RSA key generation has become 5 times slower according to these benchmarks. Might be worth looking at?
It looks like the RSA benchmark is not deterministic.
Master:
Current PR:
Anyway, I don't think that these structures are used there and that they can affect the performance.
I did find a regression created by this PR. There was a significant degradation of multithreading due to the fact that the data in different threads of the same structure type were next to each other, because of which we were only working with the 3rd level of the processor cache.
Now I have not only fixed it, but even improved it. Separate allocators are used for each thread. This helped each thread to use its own mutex, which greatly improved performance in multithreading.
So it can be tested with this code:
func _some_job(n):
var dict : Dictionary
var start :int = Time.get_ticks_usec()
for i in 1000000:
dict[i] = i
var end :int = Time.get_ticks_usec()
for res in dict:
pass
print("Time insert:", end - start)
func _ready() -> void:
var start :int = Time.get_ticks_usec()
var task_id := WorkerThreadPool.add_group_task(_some_job, 8, -1, true)
WorkerThreadPool.wait_for_group_task_completion(task_id)
var end :int = Time.get_ticks_usec()
print("Full Time insert:", end - start)
Results Master:
Time insert:509230
Time insert:592604
Time insert:559632
Time insert:577218
Time insert:600717
Time insert:661088
Time insert:591868
Time insert:634930
Full Time insert:1361299
Before fixing:
Time insert:2104979
Time insert:2193414
Time insert:2261067
Time insert:2261572
Time insert:2256125
Time insert:2301050
Time insert:2307332
Time insert:2352035
Full Time insert:5498995
Now:
Time insert:335565
Time insert:338475
Time insert:374039
Time insert:483497
Time insert:484436
Time insert:485592
Time insert:493170
Time insert:490287
Full Time insert:790270
Edit: Actually the multithreading bottleneck is the atomic variable in the Memoy class, not malloc: https://github.com/godotengine/godot/blob/a40fc2354a1639f283c7c46ba8dc321b7ff15960/core/os/memory.cpp#L112
nice
This PR is very interesting! I think there's 2 core ideas that are worth exploring:
- Using shared (
Static) fixed size allocators forHashMap, to reduce allocation overhead. - Use freed memory in pools to store pointers to the next free allocation in, to save on RAM use.
Note: I haven't tested this PR directly yet, since I wanted to get a feeling for how it works and what it proposes first. The following are my findings for exploring the ideas in isolation.
Using shared fixed size allocators
To explore this idea in isolation, I implemented StaticPagedAllocator based on PagedAllocator (to avoid de-references):
(based on a quick PagedAllocator size-specific rewrite)
template <size_t N, bool thread_safe = false, uint32_t DEFAULT_PAGE_SIZE = 4096>
class PagedMAllocator {
struct Slot {
char data[N];
};
Slot **page_pool = nullptr;
Slot ***available_pool = nullptr;
uint32_t pages_allocated = 0;
uint32_t allocs_available = 0;
uint32_t page_shift = 0;
uint32_t page_mask = 0;
uint32_t page_size = 0;
SpinLock spin_lock;
public:
void *alloc() {
if constexpr (thread_safe) {
spin_lock.lock();
}
if (unlikely(allocs_available == 0)) {
uint32_t pages_used = pages_allocated;
pages_allocated++;
page_pool = (Slot **)memrealloc(page_pool, sizeof(Slot *) * pages_allocated);
available_pool = (Slot ***)memrealloc(available_pool, sizeof(Slot **) * pages_allocated);
page_pool[pages_used] = (Slot *)memalloc(sizeof(Slot) * page_size);
available_pool[pages_used] = (Slot **)memalloc(sizeof(Slot *) * page_size);
for (uint32_t i = 0; i < page_size; i++) {
available_pool[0][i] = &page_pool[pages_used][i];
}
allocs_available += page_size;
}
allocs_available--;
void *alloc = available_pool[allocs_available >> page_shift][allocs_available & page_mask];
if constexpr (thread_safe) {
spin_lock.unlock();
}
return alloc;
}
void free(void *p_mem) {
if constexpr (thread_safe) {
spin_lock.lock();
}
available_pool[allocs_available >> page_shift][allocs_available & page_mask] = (Slot *)p_mem;
allocs_available++;
if constexpr (thread_safe) {
spin_lock.unlock();
}
}
private:
void _reset(bool p_allow_unfreed) {
if (!p_allow_unfreed) {
ERR_FAIL_COND(allocs_available < pages_allocated * page_size);
}
if (pages_allocated) {
for (uint32_t i = 0; i < pages_allocated; i++) {
memfree(page_pool[i]);
memfree(available_pool[i]);
}
memfree(page_pool);
memfree(available_pool);
page_pool = nullptr;
available_pool = nullptr;
pages_allocated = 0;
allocs_available = 0;
}
}
public:
void reset(bool p_allow_unfreed = false) {
if constexpr (thread_safe) {
spin_lock.lock();
}
_reset(p_allow_unfreed);
if constexpr (thread_safe) {
spin_lock.unlock();
}
}
bool is_configured() const {
if constexpr (thread_safe) {
spin_lock.lock();
}
bool result = page_size > 0;
if constexpr (thread_safe) {
spin_lock.unlock();
}
return result;
}
void configure(uint32_t p_page_size) {
if constexpr (thread_safe) {
spin_lock.lock();
}
ERR_FAIL_COND(page_pool != nullptr); // Safety check.
ERR_FAIL_COND(p_page_size == 0);
page_size = nearest_power_of_2_templated(p_page_size);
page_mask = page_size - 1;
page_shift = get_shift_from_power_of_2(page_size);
if constexpr (thread_safe) {
spin_lock.unlock();
}
}
// Power of 2 recommended because of alignment with OS page sizes.
// Even if element is bigger, it's still a multiple and gets rounded to amount of pages.
PagedMAllocator(uint32_t p_page_size = DEFAULT_PAGE_SIZE) {
configure(p_page_size);
}
~PagedMAllocator() {
if constexpr (thread_safe) {
spin_lock.lock();
}
bool leaked = allocs_available < pages_allocated * page_size;
if (leaked) {
if (CoreGlobals::leak_reporting_enabled) {
ERR_PRINT(String("Pages in use exist at exit in PagedMAllocator: ") + String::num_int64(N));
}
} else {
_reset(false);
}
if constexpr (thread_safe) {
spin_lock.unlock();
}
}
};
template <typename T, bool thread_safe = false, uint32_t DEFAULT_PAGE_SIZE = 4096>
class PagedAllocator {
PagedMAllocator<sizeof(T), thread_safe, DEFAULT_PAGE_SIZE> mallocator;
public:
template <typename... Args>
_FORCE_INLINE_ T *alloc(Args &&...p_args) {
T *alloc = (T *)mallocator.alloc();
memnew_placement(alloc, T(std::forward<Args>(p_args)...));
return alloc;
}
_FORCE_INLINE_ void free(T *p_mem) {
p_mem->~T();
mallocator.free(p_mem);
}
_FORCE_INLINE_ bool is_configured() const { return mallocator.is_configured(); }
_FORCE_INLINE_ void configure(uint32_t p_page_size) { mallocator.configure(p_page_size); }
template <typename... Args>
_FORCE_INLINE_ T *new_allocation(Args &&...p_args) { return alloc(std::forward<Args>(p_args)...); }
_FORCE_INLINE_ void delete_allocation(T *p_mem) { free(p_mem); }
_FORCE_INLINE_ void reset(bool p_allow_unfreed = false) {
mallocator.reset(p_allow_unfreed || std::is_trivially_destructible_v<T>);
}
PagedAllocator(uint32_t p_page_size = DEFAULT_PAGE_SIZE) {
configure(p_page_size);
}
_FORCE_INLINE_ ~PagedAllocator() {
reset(false);
}
};
#include "core/templates/paged_allocator.h"
template <size_t N, bool thread_safe>
class StaticPagedMAllocator {
static inline PagedMAllocator<N, thread_safe> mallocator;
public:
_FORCE_INLINE_ void *alloc() { return mallocator.alloc(); }
_FORCE_INLINE_ void free(void *p_mem) { mallocator.free(p_mem); }
};
template <typename T, bool thread_safe>
class StaticPagedAllocator {
public:
template <typename... Args>
_FORCE_INLINE_ T *alloc(Args &&...p_args) {
static StaticPagedMAllocator<sizeof(T), thread_safe> mallocator;
T *alloc = (T *)mallocator.alloc();
memnew_placement(alloc, T(std::forward<Args>(p_args)...));
return alloc;
}
_FORCE_INLINE_ void free(T *p_mem) {
static StaticPagedMAllocator<sizeof(T), thread_safe> mallocator;
p_mem->~T();
mallocator.free(p_mem);
}
template <typename... Args>
_FORCE_INLINE_ T *new_allocation(Args &&...p_args) { return alloc(std::forward<Args>(p_args)...); }
_FORCE_INLINE_ void delete_allocation(T *p_mem) { free(p_mem); }
};
I used this allocator in HashMap by default.
Unfortunately, while a very interesting idea, this made for no practical performance differences (both RAM and speed). The reason may be that macOS uses jemalloc as its allocator, which is already very good. But it may be worth it to explore this idea again in the future, at least for specific use-cases.
Use freed memory in pools to store pointers in
I explored this idea by patching PagedAllocator, to use a free chain instead of free values instead of an available_pool.
This decreased global RAM use of Godot by 10mb (yay), but measurably regressed deallocation speed (3x slower). I gotta be honest, I don't understand why there was such a regression on freeing performance. Allocation also regressed, but only by ~5%.
All in all this idea i still very interesting to me, but again, jemalloc is very good already, and trying to one-up it can be difficult. I'm open to re-exploring this idea in the future though!
Using both at once
I tried using both patches at once, to see if RAM can be improved even more. There was a slight regression of RAM use over master, so I did not explore this idea further (unfortunate though it is!).
Takeaways
Both ideas are interesting to me, but I was unable to implement them to measurably improve over master.
To re-iterate, I haven't tested your code yet, because I wanted to understand the ideas being proposed here first. I feel confident now that I do, so I'll probably have a go at measuring your implementation soon.
If you're still open to maintaining this PR, I can definitely see it being useful to split it into two:
- adding
BlockAllocatoras an alternative or replacement forPagedAllocator, which it is in direct competition with. Its performance could be tested inStringName, where it may have an interesting impact. - adding static / shared allocators (either
PagedAllocatorbased, orBlockAllocatorbased), and use them inList,HashMapand the likes.
Needs rebase