godot icon indicating copy to clipboard operation
godot copied to clipboard

[Core] Fix COW allocates too much memory

Open Nazarwadim opened this issue 1 year ago • 12 comments

I found a place where COW allocates too much memory for types whose size is not a multiple of 2. For example, we have a type size of 20 bytes and want to change the COW size to 4. To do this, call the resize(4) method. This method calls the _get_alloc_size method which allocates too much memory for us. _next_po2(20 * 4) = 128. But instead it should be _next_po2(4) * 20 = 80.

I haven't done performance benchmarks yet, but only looked at the amount of memory used by the project from CI https://github.com/godotengine/regression-test-project/archive/4.0.zip.

Before: image

After: image

Nazarwadim avatar Apr 28 '24 19:04 Nazarwadim

_next_po2(20 * 4) = 124

Did you mean 128?

I'm not sure about this, the power of two is expected to be on the allocated memory, otherwise it doesn't matter, it's relevant that the allocation is a power of two

It doesn't make any sense to me to allocate for a power of two number of elements, so either it should be a power of two bytes, or nothing at all

AThousandShips avatar Apr 28 '24 19:04 AThousandShips

There is a lot of memory that is simply not used. Like, why allocate 32 bytes when you can allocate 20. For 2 elements, it will be 48, although there is 40 real data there.

Nazarwadim avatar Apr 28 '24 20:04 Nazarwadim

Well it depends on what we're trying to achieve here, often allocating a power of two memory is desirable to the OS, as far as I know, I don't think the use of this particular formula here was in any way an oversight by the implementer but desired

AThousandShips avatar Apr 28 '24 20:04 AThousandShips

I checked the project, from issue. Godot didn't crash, but my computer did. I don't know if this is the right behavior.

Nazarwadim avatar Apr 28 '24 20:04 Nazarwadim

Did you try with the check restored?

AThousandShips avatar Apr 28 '24 20:04 AThousandShips

I opened the file godot_4_alpha_12_tileset_issue_2.log and in a few seconds the computer ran out of memory. The same thing happens on the master branch.

Nazarwadim avatar Apr 28 '24 20:04 Nazarwadim

The issue isn't fully solved for that side, but this check should probably not be removed, it's not unnecessary, so unless you can confirm it's unnecessary it should be restored IMO, the zero check can be removed but not the return *out; part

AThousandShips avatar Apr 28 '24 20:04 AThousandShips

If you change from return true; to return *out; and remove check, it won't work, because when p_elements = 0, false will be returned.

Nazarwadim avatar Apr 28 '24 21:04 Nazarwadim

And this zero check with return *out; is unnecessary due to this. ERR_FAIL_COND_V(!_get_alloc_size_checked(p_size, &alloc_size), ERR_OUT_OF_MEMORY); If we leave it as it was, then the compiler does two checks. The first one is whether x == 0. And the second one is this one, since it cannot know that *out will definitely be 0. But if I return true, the compiler will remove this check, which, by the way, reduces the size of the binary and improves performance a little. To be more precise, the size of a mono release before 77 878 576 and after 77 845 808.

I also took a closer look at your PR and noticed that before COW was 32-bit. This explains the overflow.

_next_po2 method body.

if (sizeof(USize) == 8) {
	x |= x >> 32;
}

This fixed the problem.

Nazarwadim avatar Apr 28 '24 23:04 Nazarwadim

Like, why allocate 32 bytes when you can allocate 20. For 2 elements, it will be 48, although there is 40 real data there.

See this explaination by Juan: https://youtu.be/BvpBEIe86CE?t=1630

timothyqiu avatar Apr 29 '24 04:04 timothyqiu

I just realized today that we don't allocate 2^n bytes. We allocate 2^n + 16 and + 16 if debug.

uint8_t *mem_new = (uint8_t *)Memory::alloc_static(alloc_size + DATA_OFFSET, false);

These are the expansion ranges before: 48 80 144 272 528 1040. After: 36 56 96 176 336 656.

This shows that the allocation is in no way similar to 2^n.

See this explaination by Juan: https://youtu.be/BvpBEIe86CE?t=1630

From the video, I understand the doubling of the size, not that it should be strictly 2^n.

And also in other implementations of the vector, for example, std::vector, or List from C#, the expansion occurs by multiplying capacity by 2.

Nazarwadim avatar Apr 29 '24 05:04 Nazarwadim

_next_po2(20 * 4) = 128. But instead it should be _next_po2(4) * 20 = 80.

What's the real memory usage difference, not the one shown by Godot? Underlying malloc will never allocate exact size, it's either picking up free slot from pre-allocated small size tables (usually few powers of 2) or allocating multiple of 4K pages.

bruvzg avatar Apr 29 '24 05:04 bruvzg

What's the real memory usage difference.

My 173.3 MB Master 173.5 MB.

It looks like a much smaller difference.

I'm thinking about it now. For example, we have 1024 bytes of data and another 16 bytes of metadata(ref_couter + size). And as I understand it, malloc will allocate 2048 bytes. And when we need to expand the size to 2048, then malloc will allocate 4096. Although we only need 2048. My idea is to adjust the size of alloc_size + DATA_OFFSET to 2^n. Most likely, this will require the following formula next_po2(p_elements * sizeof(T)) - DATA_OFFSET. I would like to hear your thoughts on this.

Nazarwadim avatar Apr 29 '24 12:04 Nazarwadim

This makes sense to take metadata into account.

For small allocs it actually might have more options than powers of 2, like https://git.musl-libc.org/cgit/musl/tree/src/malloc/mallocng/malloc.c#n12 but it's platform/CRT dependent, so I would not relay on any specific values.

bruvzg avatar Apr 29 '24 12:04 bruvzg

Then do I need to do it like this so that both small and large data types take up less space?

next_po2(p_elements) * sizeof(T) - DATA_OFFSET

Nazarwadim avatar Apr 29 '24 12:04 Nazarwadim