About this PR

This PR improves performance and reduces RAM usage of the entire engine where HashMap, List, RBMap, RBSet are used. I implemented new StaticBlockAllocator for such structures as HashMap, List, RBMap, RBSet. I have also implemented a BlockAllocator which is used by StaticBlockAllocator.

How elements in structures allocated?

Structures like HashMap, List, RBMap, RBSet currently use malloc which is wraped into Memory class for allocate and free memory for a new element. As many know this allocator is a general purpose allocator. For these data structures, for each element, malloc allocates a lot of metadata and also allocates a little more space than we asked for in case we want to use realloc. That is, if we want to allocate 16 bytes of data, malloc will allocate 32 of which 8 bytes will go to store the size we allocated and 8 bytes for the realloc. It is obvious that we are not going to allocate more than what is allocated to the List element.

Memory means class Memory from core.

Custom allocators

The solution to this is to use custom allocators. Currently, godot uses PagedAllocator for RID allocation and for some HashMap. But this allocator is bad as the default allocator in that it needs to be initialized for each new instance, and the allocator itself weighs a lot, so it will not be suitable for scene use, since the nodes will weigh twice as much, and adding the first element to the Node will last a long.

So for default it turned out to be best to implement a static allocator that would have its own PagedAllocator for each element size. It turns out that PagedAllocator is not suitable for these purposes. So I used my BlockAllocator.

What is BlockAllocator?

BlockAllocator is a hybrid of PagedAllocator and StackAllocator. I created it about half a year ago as my first allocator. If an element is deleted, it is written to the free_list. This list is stored in place of deleted items, which saves a lot of memory. Until the size of deleted list is not equal to 0, then the pointer of our allocator will not move further. With each new page, the page size is doubled. It turned out to be even better than PagedAllocator.

How free and alloc works.

How blocks of blocks are interconnected.

What is StaticBlockAllocator?

StaticBlockAllocator is a static allocator. It allocates to our size an id that we use to allocate and free data. When requesting an ID, the allocator checks whether it has an allocator of the required size. If there is none, then he creates it. This allocator for each thread has its own array of BlockAllocator, with unique sizes. If two threads use size 32 in two different threads, two allocators with that size will be created separately.

What is TypedStaticBlockAllocator?

This is the final templated allocator for our structures. It weighs 8 bytes and stores the id provided by StaticBlockAllocator. According to this id, it allocates and frees data. Also calls the constructor and destructor for our classes.

Performance Tests

Let's start by testing List<int>.

Master branch:malloc implementation List<int> 10 000 000 elements.

Time insert: 236674 usec
Time iteration: 39287 usec
Time clear: 141672 usec
Time total: 417736 usec

TypedStaticBlockAllocator:

Time insert: 141104 usec
Time iteration: 25416 usec
Time clear: 76181 usec
Time total: 242765 usec

Direct Usage BlockAllocator:

Time insert: 96636 usec
Time iteration: 26894 usec
Time clear: 39474 usec
Time total: 163112 usec

Direct Usage PagedAllocator:

Time insert: 106321 usec
Time iteration: 31373 usec
Time clear: 59099 usec
Time total: 196889 usec

Memory Usage List<int> for 100 000 000 elements:

BlockAllocator: 2,95 GB
PagedAllocator: 3,62 GB
Malloc: 4,63 GB

So, as you can see, my allocator works even better than PagedAllocator(although I thought it would be slower). It is worth paying attention to the improvement in iteration time, which means much faster work with allocated data, since it is more cache friendly.

Code used:

int var = 0;

template <typename T>
void test_list() {
	T list_standart;
	auto start = chrono::system_clock::now();
	{
		auto clock_start = chrono::system_clock::now();
		for (int i = 0; i < 10000000; i++) {
			list_standart.push_back(i);
		}
		auto clock_now = chrono::system_clock::now();
		auto currentTime = float(chrono::duration_cast<chrono::nanoseconds>(clock_now - clock_start).count());
		print_line("Time insert:", currentTime);
	}

	{
		auto clock_start = chrono::system_clock::now();
		for (const int num : list_standart) {
			var = num;
		}
		auto clock_now = chrono::system_clock::now();
		auto currentTime = float(chrono::duration_cast<chrono::nanoseconds>(clock_now - clock_start).count());
		print_line("Time iteration:", currentTime);
	}

	{
		auto clock_start = chrono::system_clock::now();
		list_standart.clear();
		auto clock_now = chrono::system_clock::now();
		auto currentTime = float(chrono::duration_cast<chrono::nanoseconds>(clock_now - clock_start).count());
		print_line("Time clear:", currentTime);
	}
	auto end = chrono::system_clock::now();
	auto currentTime = float(chrono::duration_cast<chrono::nanoseconds>(end - start).count());
	print_line("Time total:", currentTime);
}

void main_benchmark() {
	test_list<List<int>>();
}

Dictionary test.

Dictionary[int, String]

Master branch:

Time insert:2655 msec
Time iteration:325 msec
Time clear:801 msec
Time total:5116 msec

Current PR:

Time insert:2472 msec
Time iteration:288 msec
Time clear:500 msec
Time total:3260 msec

Code Used:

func _ready() -> void:
	var start_test :int = Time.get_ticks_msec()
	var dict : Dictionary[int, String]
	if true:
		var start :int = Time.get_ticks_msec()
		for i in 10000000:
			dict[i] = " "
		var end :int = Time.get_ticks_msec()
		print("Time insert:", end - start)
	
	if true:
		var start :int = Time.get_ticks_msec()
		for n in dict.keys():
			pass
		var end :int = Time.get_ticks_msec()
		print("Time iteration:", end - start)
		
	if true:
		var start :int = Time.get_ticks_msec()
		var num : int
		dict.clear()
		var end :int = Time.get_ticks_msec()
		print("Time clear:", end - start)
	var end_test :int = Time.get_ticks_msec()
	print("Time total:", end_test - start_test)

Memory Usage Dictionary[int, String] for 100 000 000 elements:

Master: 11.37 GB
Current PR: 8.4 GB

Code used:

for i in 100000000:
	dict[i] = " "
while true:
    pass

The speed of adding children to the node.

Add 1000000 Nodes as child:

Master: 3225423 usec
Current PR: 2825762 usec

func _ready() -> void:
	var start :int = Time.get_ticks_usec()
	for i in 1000000:
		add_child(Node.new())
	
	var end :int = Time.get_ticks_usec()
	print(end - start)

Some improved performance in MRPs

MRP from #85871. Improves performance and reduces memory usage in the editor.

Master 3,73 GB
Current PR: 2,97 GB

Memory usage can be checked via htop or task manager.

Until this #92554 isn`t merged. This PR improves animation performance by 5%.

Also, this MRP #93568 loads 5% faster. And in general, projects are loaded 5-10% faster.

Improving mesh generation

I used this demo. https://github.com/godotengine/godot-demo-projects/tree/master/3d/voxel. Here, the chart shows the before and after average mesh generation time for a chunk.

Better multithreading performance

This comment https://github.com/godotengine/godot/pull/97016#issuecomment-2356447368

Godot Benchmarks

Master: master.json
Current PR: multithread_allocator.json

Possible future PRs with the new BlockAllocator

I suggest using the new BlockAllocator in RID_Alloc. The plus side of this is that it is non-templated, meaning it won't generate new code for every new class that uses it, helping to reduce the size of the engine by a few megabytes as well as reducing CPU cache usage.
Use in the ClassDB to make it lock free and multi-thread safe.

Sep 14 '24 21:09 Nazarwadim

I did the godot-benchmark test. Compile command:

scons CC=gcc-12 CXX=g++-12 scu_build=yes scu_limit=1024 lto=full linker=mold

~~master.json~~ ~~StaticBlockAllocator.json~~ Edit: Old benchmarks, new benchmarks are in the PR description.

Sep 16 '24 18:09 Nazarwadim

Something I noticed is that RSA key generation has become 5 times slower according to these benchmarks. Might be worth looking at?

Sep 17 '24 07:09 MewPurPur

It looks like the RSA benchmark is not deterministic.

Master: Screenshot from 2024-09-17 11-46-44

Current PR: Screenshot from 2024-09-17 11-48-27

Anyway, I don't think that these structures are used there and that they can affect the performance.

Sep 17 '24 09:09 Nazarwadim

I did find a regression created by this PR. There was a significant degradation of multithreading due to the fact that the data in different threads of the same structure type were next to each other, because of which we were only working with the 3rd level of the processor cache.

Now I have not only fixed it, but even improved it. Separate allocators are used for each thread. This helped each thread to use its own mutex, which greatly improved performance in multithreading.

So it can be tested with this code:

func _some_job(n):
	var dict : Dictionary
	var start :int = Time.get_ticks_usec()
	for i in 1000000:
		dict[i] = i
	var end :int = Time.get_ticks_usec()
	for res in dict:
		pass
	print("Time insert:", end - start)

func _ready() -> void:
	var start :int = Time.get_ticks_usec()
	var task_id := WorkerThreadPool.add_group_task(_some_job, 8, -1, true)
	
	WorkerThreadPool.wait_for_group_task_completion(task_id)
	var end :int = Time.get_ticks_usec()
	print("Full Time insert:", end - start)

Results Master:

Time insert:509230
Time insert:592604
Time insert:559632
Time insert:577218
Time insert:600717
Time insert:661088
Time insert:591868
Time insert:634930
Full Time insert:1361299

Before fixing:

Time insert:2104979
Time insert:2193414
Time insert:2261067
Time insert:2261572
Time insert:2256125
Time insert:2301050
Time insert:2307332
Time insert:2352035
Full Time insert:5498995

Now:

Time insert:335565
Time insert:338475
Time insert:374039
Time insert:483497
Time insert:484436
Time insert:485592
Time insert:493170
Time insert:490287
Full Time insert:790270

Edit: Actually the multithreading bottleneck is the atomic variable in the Memoy class, not malloc: https://github.com/godotengine/godot/blob/a40fc2354a1639f283c7c46ba8dc321b7ff15960/core/os/memory.cpp#L112

Sep 17 '24 16:09 Nazarwadim

nice

Sep 30 '24 10:09 a527919

This PR is very interesting! I think there's 2 core ideas that are worth exploring:

Using shared (Static) fixed size allocators for HashMap, to reduce allocation overhead.
Use freed memory in pools to store pointers to the next free allocation in, to save on RAM use.

Note: I haven't tested this PR directly yet, since I wanted to get a feeling for how it works and what it proposes first. The following are my findings for exploring the ideas in isolation.

Using shared fixed size allocators

To explore this idea in isolation, I implemented StaticPagedAllocator based on PagedAllocator (to avoid de-references):

(based on a quick PagedAllocator size-specific rewrite)

template <size_t N, bool thread_safe = false, uint32_t DEFAULT_PAGE_SIZE = 4096>
class PagedMAllocator {
	struct Slot {
		char data[N];
	};

	Slot **page_pool = nullptr;
	Slot ***available_pool = nullptr;
	uint32_t pages_allocated = 0;
	uint32_t allocs_available = 0;

	uint32_t page_shift = 0;
	uint32_t page_mask = 0;
	uint32_t page_size = 0;
	SpinLock spin_lock;

public:
	void *alloc() {
		if constexpr (thread_safe) {
			spin_lock.lock();
		}
		if (unlikely(allocs_available == 0)) {
			uint32_t pages_used = pages_allocated;

			pages_allocated++;
			page_pool = (Slot **)memrealloc(page_pool, sizeof(Slot *) * pages_allocated);
			available_pool = (Slot ***)memrealloc(available_pool, sizeof(Slot **) * pages_allocated);

			page_pool[pages_used] = (Slot *)memalloc(sizeof(Slot) * page_size);
			available_pool[pages_used] = (Slot **)memalloc(sizeof(Slot *) * page_size);

			for (uint32_t i = 0; i < page_size; i++) {
				available_pool[0][i] = &page_pool[pages_used][i];
			}
			allocs_available += page_size;
		}

		allocs_available--;
		void *alloc = available_pool[allocs_available >> page_shift][allocs_available & page_mask];
		if constexpr (thread_safe) {
			spin_lock.unlock();
		}
		return alloc;
	}

	void free(void *p_mem) {
		if constexpr (thread_safe) {
			spin_lock.lock();
		}
		available_pool[allocs_available >> page_shift][allocs_available & page_mask] = (Slot *)p_mem;
		allocs_available++;
		if constexpr (thread_safe) {
			spin_lock.unlock();
		}
	}

private:
	void _reset(bool p_allow_unfreed) {
		if (!p_allow_unfreed) {
			ERR_FAIL_COND(allocs_available < pages_allocated * page_size);
		}
		if (pages_allocated) {
			for (uint32_t i = 0; i < pages_allocated; i++) {
				memfree(page_pool[i]);
				memfree(available_pool[i]);
			}
			memfree(page_pool);
			memfree(available_pool);
			page_pool = nullptr;
			available_pool = nullptr;
			pages_allocated = 0;
			allocs_available = 0;
		}
	}

public:
	void reset(bool p_allow_unfreed = false) {
		if constexpr (thread_safe) {
			spin_lock.lock();
		}
		_reset(p_allow_unfreed);
		if constexpr (thread_safe) {
			spin_lock.unlock();
		}
	}

	bool is_configured() const {
		if constexpr (thread_safe) {
			spin_lock.lock();
		}
		bool result = page_size > 0;
		if constexpr (thread_safe) {
			spin_lock.unlock();
		}
		return result;
	}

	void configure(uint32_t p_page_size) {
		if constexpr (thread_safe) {
			spin_lock.lock();
		}
		ERR_FAIL_COND(page_pool != nullptr); // Safety check.
		ERR_FAIL_COND(p_page_size == 0);
		page_size = nearest_power_of_2_templated(p_page_size);
		page_mask = page_size - 1;
		page_shift = get_shift_from_power_of_2(page_size);
		if constexpr (thread_safe) {
			spin_lock.unlock();
		}
	}

	// Power of 2 recommended because of alignment with OS page sizes.
	// Even if element is bigger, it's still a multiple and gets rounded to amount of pages.
	PagedMAllocator(uint32_t p_page_size = DEFAULT_PAGE_SIZE) {
		configure(p_page_size);
	}

	~PagedMAllocator() {
		if constexpr (thread_safe) {
			spin_lock.lock();
		}
		bool leaked = allocs_available < pages_allocated * page_size;
		if (leaked) {
			if (CoreGlobals::leak_reporting_enabled) {
				ERR_PRINT(String("Pages in use exist at exit in PagedMAllocator: ") + String::num_int64(N));
			}
		} else {
			_reset(false);
		}
		if constexpr (thread_safe) {
			spin_lock.unlock();
		}
	}
};

template <typename T, bool thread_safe = false, uint32_t DEFAULT_PAGE_SIZE = 4096>
class PagedAllocator {
	PagedMAllocator<sizeof(T), thread_safe, DEFAULT_PAGE_SIZE> mallocator;

public:
	template <typename... Args>
	_FORCE_INLINE_ T *alloc(Args &&...p_args) {
		T *alloc = (T *)mallocator.alloc();
		memnew_placement(alloc, T(std::forward<Args>(p_args)...));
		return alloc;
	}

	_FORCE_INLINE_ void free(T *p_mem) {
		p_mem->~T();
		mallocator.free(p_mem);
	}

	_FORCE_INLINE_ bool is_configured() const { return mallocator.is_configured(); }

	_FORCE_INLINE_ void configure(uint32_t p_page_size) { mallocator.configure(p_page_size); }

	template <typename... Args>
	_FORCE_INLINE_ T *new_allocation(Args &&...p_args) { return alloc(std::forward<Args>(p_args)...); }
	_FORCE_INLINE_ void delete_allocation(T *p_mem) { free(p_mem); }

	_FORCE_INLINE_ void reset(bool p_allow_unfreed = false) {
		mallocator.reset(p_allow_unfreed || std::is_trivially_destructible_v<T>);
	}

	PagedAllocator(uint32_t p_page_size = DEFAULT_PAGE_SIZE) {
		configure(p_page_size);
	}

	_FORCE_INLINE_ ~PagedAllocator() {
		reset(false);
	}
};

#include "core/templates/paged_allocator.h"

template <size_t N, bool thread_safe>
class StaticPagedMAllocator {
    static inline PagedMAllocator<N, thread_safe> mallocator;

public:
    _FORCE_INLINE_ void *alloc() { return mallocator.alloc(); }
    _FORCE_INLINE_ void free(void *p_mem) { mallocator.free(p_mem); }
};

template <typename T, bool thread_safe>
class StaticPagedAllocator {
public:
    template <typename... Args>
    _FORCE_INLINE_ T *alloc(Args &&...p_args) {
        static StaticPagedMAllocator<sizeof(T), thread_safe> mallocator;
        T *alloc = (T *)mallocator.alloc();
        memnew_placement(alloc, T(std::forward<Args>(p_args)...));
        return alloc;
    }

    _FORCE_INLINE_ void free(T *p_mem) {
        static StaticPagedMAllocator<sizeof(T), thread_safe> mallocator;
        p_mem->~T();
        mallocator.free(p_mem);
    }

    template <typename... Args>
    _FORCE_INLINE_ T *new_allocation(Args &&...p_args) { return alloc(std::forward<Args>(p_args)...); }
    _FORCE_INLINE_ void delete_allocation(T *p_mem) { free(p_mem); }
};

I used this allocator in HashMap by default. Unfortunately, while a very interesting idea, this made for no practical performance differences (both RAM and speed). The reason may be that macOS uses jemalloc as its allocator, which is already very good. But it may be worth it to explore this idea again in the future, at least for specific use-cases.

Use freed memory in pools to store pointers in

I explored this idea by patching PagedAllocator, to use a free chain instead of free values instead of an available_pool. This decreased global RAM use of Godot by 10mb (yay), but measurably regressed deallocation speed (3x slower). I gotta be honest, I don't understand why there was such a regression on freeing performance. Allocation also regressed, but only by ~5%. All in all this idea i still very interesting to me, but again, jemalloc is very good already, and trying to one-up it can be difficult. I'm open to re-exploring this idea in the future though!

Using both at once

I tried using both patches at once, to see if RAM can be improved even more. There was a slight regression of RAM use over master, so I did not explore this idea further (unfortunate though it is!).

Takeaways

Both ideas are interesting to me, but I was unable to implement them to measurably improve over master.

To re-iterate, I haven't tested your code yet, because I wanted to understand the ideas being proposed here first. I feel confident now that I do, so I'll probably have a go at measuring your implementation soon.

If you're still open to maintaining this PR, I can definitely see it being useful to split it into two:

adding BlockAllocator as an alternative or replacement for PagedAllocator, which it is in direct competition with. Its performance could be tested in StringName, where it may have an interesting impact.
adding static / shared allocators (either PagedAllocator based, or BlockAllocator based), and use them in List, HashMap and the likes.

May 14 '25 12:05 Ivorforce

Needs rebase

May 28 '25 21:05 DeeJayLSP

Implement StaticBlockAllocator for HashMap, List, RBMap, RBSet.

About this PR

How elements in structures allocated?

Custom allocators

What is BlockAllocator?

What is StaticBlockAllocator?

What is TypedStaticBlockAllocator?

Performance Tests

Some improved performance in MRPs

Improving mesh generation

Better multithreading performance

Godot Benchmarks

Possible future PRs with the new BlockAllocator

Using shared fixed size allocators

Use freed memory in pools to store pointers in

Using both at once

Takeaways