godot
godot copied to clipboard
[3.x] Revamp PoolVector
Modify PoolVector internals to use LocalVector internally, and remove the hard limit on the number of PoolVectors.
While in master, PoolVector has been replaced wholesale with Vector, in order to minimize disruption to existing codebase in 3.x, I have been testing making PoolVector a simple wrapper around LocalVector
giving it COW and refcounting and compatibility with the existing PoolVector
interface.
Fixes #37154 Fixes #53665 Alternative to #54263
Update
When using DEV_ENABLED
builds it will now throw a warning when contention is detected. This is a situation which could result in undefined behaviour when compiling without GODOT_POOL_VECTOR_USE_LOCK_PER_ALLOC
.
There are two separate methods for detecting contention depending on whether the PR is compiled using single global lock
or lock per Alloc
.
Thread safety
Although I originally started this PR in order to simply address the hard limit on the number of allocations, I've also noticed that there is a potential thread safety issue, which could lead to data corruption.
To be clear the original implementation uses a mutex for the copy_on_write
and resize
, which are the most important points to cover (because the data address may change), however it doesn't currently lock the Alloc
for reads and writes.
Simple or pedantic locking
In order to cover more situations and detect whether this was a problem, this PR features two ways of handling thread safety:
- Single global mutex used on resize and COW, mirroring the previous approach
- Mutex per
Alloc
, which gives fine grained locking, and locks on all operations, including reserving aRead
or aWrite
. (we could alternatively use a simpler locking primitive)
Both will print warnings (once to avoid spamming) in DEV_ENABLED builds where it detects contention - situations where more than one thread is attempting to read and write the same data at the same time. This has high potential to be a bug. It turns out that this situation is occurring with high frequency when the renderer is set to "multithreaded" mode (not the default), especially when loading scenes.
@jordo has pointed out that to some extent we could say the object T
should be held responsible for thread safety within set
and get
operations, and this may apply for complex types. However many of the PoolVector
s are used with simple types such as Vector3
so there is the possibility for mangled results, or more likely, reading an array half in one state and half in another.
Should COW protect against multithread access?
To some extent, it seems that COW should be in a position to provide some protection against multiple threads using the same Alloc
. Providing a PoolVector
is copied, rather than used via a pointer or by reference, then the new PoolVector
should copy the Alloc
on write, ensuring that the thread reading will read the same historical data.
This is probably working in a lot of situations, but not all. There appear to be some situations where the contention is happening in practice. There also appear to be a number of cases which may also be susceptible, such as passing data by const reference PoolVector
s from the scene tree to the servers.
How to solve
It is difficult to know what the best way to "solve" this problem is. For now we are thinking of having version (1) active (i.e. the same behaviour as before), and see if we can detect contention in DEV_ENABLED builds and flag a warning, without affecting behaviour.
We could alternatively go with (2), however there are additional costs in terms of memory use (at the least 80 bytes per mutex, and possibly more), and potentially in terms of performance (a lot more locking - this hasn't seemed too noticable but I haven't benchmarked), and construction / destruction of the mutexes.
With (2) the paradigm becomes only one threads Read
or Write
can have access to the Alloc at a time (a single thread can do a double lock, as the mutex is recursive). This could lead to some coarse grained waiting if there is contention across threads, but it should be safe, and locking per Read
/ Write
is potentially a lot more efficient than e.g. locking per element.
Notes
- Still a WIP, particularly we need to decide on the locking method.
- This version unlike the original
PoolVector
allocates no memory up front, and has no hard limits to the number ofPoolVector
s. This means memory RAM use is typically lower than before, and is probably a better approach to the limit than increasing it and increasing the fixed RAM cost. - I first tested this wrapping
Vector
, however this duplicate the COW functionality inPoolVector
, so I'm now usingLocalVector
, which has no COW functionality, and it seems to be working fine so far. - Any locking / multithread safety is handled by
PoolVector
andPoolVector::Access
, asLocalVector
is not thread safe. - Debug memory counts are now implemented as before, using a mutex (there is the possibility to use atomics / SafeNumerics directly for these counters, and I have had this working, but would rather leave to a 2nd PR as the benefits for performance are not yet clear).
- If we are happy with this we may decide later to put the
PoolVector_Alloc
s into one of theCommonPools
, using e.g.PagedAllocator
, rather than make the dynamic allocations using malloc (although of course the vector allocations themselves for resizing will still use malloc).
@lawnjelly @reduz can we setup a discussion on contributors chat for this stuff? I feel like I'm really lost with whats going on with this stuff as of late, and I'm having a hard time understanding what changes are being made, what the problems are, what are the design goals, etc. I'm not opposed to any of this, and welcome all and any efforts to improve (there definitely needs to be improvement), but the changes are coming in hot and fast with not much formal discussion or technical proposals of what exactly the issues are, how any PR and/or refactoring directly addresses them, and benchmarks to examine .
The core fundamentals & data structures of the engine are in dire need of additional analysis & scrutiny, unit testing & benchmarks, and technical documentation that can be understood a group of contributors. This is a key area that I have, over the past year, spent too many hours modifying and working around to meet performance requirements of a game where we needed to be performance first.
-
- Can we define technically, what the design goals of
PoolAllocator
were, and how it accomplished them? Because this PR removes it completely. From what I can tell on a first glance, it's a pre-allocated block, which is/was used as a generic memorypool
with a limited amount of compaction and resizing capabilities perEntry
.
- Can we define technically, what the design goals of
-
- Since this PR maintains the data structure called
PoolVector
, in 3.X we haveVector
,LocalVector
, andPoolVector
(not that we don't already have them already). But what exactly is PoolVector, and what does it do? Is it documented anywhere? Is it effectively now just a reference counted, COW,LocalVector
? Is it even supposed to be thread safe, or not? Does it have stable element references on iteration? etc.
- Since this PR maintains the data structure called
The second comment above is more a general thought about the core structures of the engine, but it illustrates a point that none of this is really well communicated or formally documented anywhere.
As a contrast, we've moved almost all of our performance critical Dictionary & Hashmap operations to https://github.com/martinus/robin-hood-hashing (either it's flat_map, or node based memory backing layouts), and seen absolutely ridiculous performance improvements across the board and almost everywhere. This single data structure's test dir has an incredible number of unit tests and benchmarks (https://github.com/martinus/robin-hood-hashing/tree/master/src/test/unit). I just think it's a great example, that the foundational building blocks of a project as large as godot, especially data structures, need to have proper documentation, tests, and benchmark suites. Otherwise its very difficult to understand what's changing, what the goals are, and try to objectively determine what results in the best outcome performance wise.
With regards to the direct questions and comments that @lawnjelly brought up in this PR, there's a lot. Would we be able to schedule a contributors discussion and perhaps even post a more formal write up with what the problems are (other than just linked issue #'s), and what we are changing and why? I think that would be really helpful.
@lawnjelly @reduz can we setup a discussion on contributors chat for this stuff? I feel like I'm really lost with whats going on with this stuff as of late, and I'm having a hard time understanding what changes are being made, what the problems are, what are the design goals, etc.
@jordo I'm in broad agreement with everything you've said there, and first of all "don't panic", everything especially in core tends to go through an extensive period of scrutiny and peer review before we consider merging anything - no one gets to unilaterally make changes on their own, even reduz PRs are open for review.
There should be plenty of time and interest for your input, and it is always encouraged as you are well versed in this area. Indeed as I said on rocket chat the other day you are very encouraged to make PRs or draft PRs / proposals (often we have different ways of approaching a problem, as here we have a linked PR by @Calinou which spurred me to have a look at the problem and experiment with this alternative approach), I think you said you'd be able to do more after a deadline at work.
A meeting would be great actually to discuss some of these core aspects in 3.x and 4.x. Makes it far easier to pin down Juan and get his input as he is usually busy with lots of things. :grinning:
Irrespective of meetings / short discussions on rocket chat, often we do the majority of discussion in comments on PRs too, as this works no matter what time zone you are in etc.
The core fundamentals & data structures of the engine are in dire need of additional analysis & scrutiny, unit testing & benchmarks, and technical documentation that can be understood a group of contributors. This is a key area that I have, over the past year, spent too many hours modifying and working around to meet performance requirements of a game where we needed to be performance first.
Agreed.
Can we define technically, what the design goals of PoolAllocator were, and how it accomplished them? Because this PR removes it completely. From what I can tell on a first glance, it's a pre-allocated block, which is/was used as a generic memory pool with a limited amount of compaction and resizing capabilities per Entry.
It was dead code, as far as I could see. Looked to be used historically from PoolVector
, but not currently. So the policy is normally to remove dead code where possible (it remains in the git history so can be salvaged). Juan mentions in the PR that removes PoolVector
in 4.x, that a lot of this may be historical stuff from when machines had very limited memory.
Since this PR maintains the data structure called PoolVector, in 3.X we have Vector, LocalVector, and PoolVector (not that we don't already have them already). But what exactly is PoolVector, and what does it do? Is it documented anywhere? Is it effectively now just a reference counted, COW, LocalVector? Is it even supposed to be thread safe, or not? Does it have stable element references on iteration? etc.
Yes again, this is lack of documentation, this should imo be at the top of the header file a description of what the template is, what it is meant to do etc. By reverse engineering I worked out it is meant to be a COW reference-counted vector. The multithread paradigm was totally unclear, but empirical testing for simultaneous access suggests to me it is being used from multiple threads, and it does not look currently thread safe to do so. Whether that constitutes a problem in the client code, or a problem in the template itself, I cannot say.
I just think it's a great example, that the foundational building blocks of a project as large as godot, especially data structures, need to have proper documentation, tests, and benchmark suites. Otherwise its very difficult to understand what's changing, what the goals are, and try to objectively determine what results in the best outcome performance wise.
Do agree, but to be pragmatic also - large engines like Godot can end up being the result of a large amount of historical stuff. Hence there is a lot of stuff using linked lists which should really be using vectors, vectors which could be local vectors etc etc. This was part of the impetus for the large changes in 4.x I believe.
Especially in 3.x (which this PR is aimed at), it is a rather special situation - we have to work with what we have, there isn't a huge scope for change (as we need backward compatibility, and we don't want to rewrite the rest of the codebase).
So in 3.x the limited core changes that we might make are likely to be:
- Those that are necessary to fix bugs
- Those that can relatively simply make current systems run more efficiently
Thus for 3.x there seem to be two bugs in the existing PoolVector
, which would be nice to close:
- Hard limit of 65535 (and associated fixed memory use scaling with this limit).
- Possibly multithreading bug, either in use or within the template or both.
Just to note briefly discussed this this morning:
lawnjelly 9:43 AM reduz I've been having a bit of a look at PoolVector in 3.x. I realised you have changed it mostly to use Vector in 4.x, so I have been experimenting with changing the internals of PoolVector in 3.x to make it a wrapper but still maintain the same API, primarily in order to get around the limit of 65535 allocations. This has been going really well, and I have a draft PR: #59444
However what has foxed me is that while I can see PoolVector should be reference counted and COW, I'm unsure of whether it should be supporting multithreading. There is some effort at locks in the existing code, but it doesn't provide very much safety. And when I put some debugging in, it seems that in some builds of the engine (especially multithreaded renderer), and in some situations (loading?) PoolVectors do seem to be being used simultaneously from multiple threads. So what I'm wondering is, in your view, would it be better to tighten up PoolVector to make it as thread safe as possible, or to examine the calling code and try and prevent these multithread accesses?
reduz 📛 reduz 📛 9:49 AM For Godot 3 I would not worry a lot, you can just use Vector behind the back
lawnjelly lawnjelly 9:49 AM I must admit I'm slightly concerned that other than read situations, there is a great potential for things to go horribly wrong with multithread use.
reduz 📛 reduz 📛 9:49 AM i dont think this is used in multithreading much in G3
So this confirms that at least Juan isn't aware of any pre-existing scheme to handle multithread access here, and we should try and address it as best we can (given the practicalities and the desire to not change too much).
I will try and investigate the multithread access a bit more. There is always the option to do nothing to improve multithreading over the old implementation, but if there is some low hanging fruit here in terms of improvements it may be worth taking advantage (providing it doesn't compromise performance - some locking can be essentially free if it is rarely contended).
@lawnjelly can you provide a example where multithreaded access and/or write causes invalid access? This could also be checked empirically with tsan and a heavy-load test suite. But what critical sections do you want to lock/unlock on? There already looks to be many places in CowData<T> that look like it attempts to be thread-safe, but it's absolutely not. So the consideration of thread safety at the higher PoolVector layer maybe should just be punted on until some of the foundational structures are tightened up? Example:
https://github.com/godotengine/godot/blob/3.x/core/cowdata.h#L227-L228
This above is a good example of thread safety done incorrectly. _copy_on_write()
is supposed to be thread safe as it trys to use a SafeNumeric, but you can obviously see in the above snippet, if execution is preempted after executing line 227 (loading rc to register), and another thread just so happens to increment the refcount from 1 to > 1 in between those lines, we end up with issues and problems.
The fact that this works most of the time and doesn't crash and do unexpected things randomly I'm guessing is just due to the fact that these critical sections aren't hit or hammered hard enough with multi-threaded load. Perhaps not many developers or projects using these data structures in such environments.
Another example in cowdata.h:
https://github.com/godotengine/godot/blob/3.x/core/cowdata.h#L200
The above leads to a double-free bug/exploit. Line 200 is not atomic, as the decrement + conditional evaluation is not an atomic operattion despite being a single line. This leads to double-free bug.
Just to note we discussed this yesterday on rocketchat:
This PR is using LocalVector
rather than Vector
. LocalVector
has no COW or refcounting, and PoolVector uses a different COW solution. @jordo 's comments about CowData
don't apply here (although is still relevant for Vector
of course), although the COW solution in this PR does need to be independently checked for thread safety.
Contention
Today I'll be trying a few projects and trying to ascertain an idea of how often contention happens in PoolVector
, using a mutex per Alloc. I've done this by uncommenting the section in the PR in Alloc::lock()
, where it can optionally call try_lock
. If the try_lock
fails, I print a WARNING and hit a breakpoint.
I try to use some real game projects rather than demos where possible, as these can better represent real world use cases. These are often closed source.
So far it seems that contention seems more a problem when multithread rendering is selected in the project setting, rather than the default single-safe. It also doesn't seem to be occurring so far during using the editor, but does during loading the games.
3D Platformer Demo
No evidence of contention.
Beat Invaders (@RPicster )
No contention in single-safe renderer. With multithread renderer there is a lot of contention. Not knowing the inner workings of the game I don't know whether there is some threading in the script, but I will show some stack traces here.
This seems to all be occurring in the initial loading phase of the game, when playing the game itself there is no contention so far.
texture_set_data
1 PoolVector<unsigned char>::Alloc::lock pool_vector.h 145 0x2bbca60
2 PoolVector<unsigned char>::Access::_ref pool_vector.h 304 0x2bbc9e7
3 PoolVector<unsigned char>::read pool_vector.h 391 0x2bba7ad
4 RasterizerStorageGLES3::texture_set_data rasterizer_storage_gles3.cpp 756 0x40f3116
5 VisualServerRaster::texture_set_data visual_server_raster.h 155 0x5ffbad2
6 CommandQueueMT::Command3<VisualServer, void (VisualServer:: *)(RID, Ref<Image> const&, int), RID, Ref<Image>, int>::call command_queue_mt.h 300 0x6086605
7 CommandQueueMT::flush_one command_queue_mt.h 441 0x5f7487a
8 CommandQueueMT::wait_and_flush_one command_queue_mt.h 478 0x5f6d287
9 VisualServerWrapMT::thread_loop visual_server_wrap_mt.cpp 78 0x6069bc8
10 VisualServerWrapMT::_thread_callback visual_server_wrap_mt.cpp 64 0x6069aed
11 Thread::callback thread.cpp 78 0x63ccbe7
12 std::__invoke_impl<void, void ( *)(Thread *, Thread::Settings const&, void ( *)(void *), void *), Thread *, Thread::Settings, void ( *)(void *), void *> invoke.h 60 0x63cdbea
13 std::__invoke<void ( *)(Thread *, Thread::Settings const&, void ( *)(void *), void *), Thread *, Thread::Settings, void ( *)(void *), void *> invoke.h 95 0x63cda61
14 std::thread::_Invoker<std::tuple<void ( *)(Thread *, Thread::Settings const&, void ( *)(void *), void *), Thread *, Thread::Settings, void ( *)(void *), void *>>::_M_invoke<0ul, 1ul, 2ul, 3ul, 4ul> thread 264 0x63cd9e0
15 std::thread::_Invoker<std::tuple<void ( *)(Thread *, Thread::Settings const&, void ( *)(void *), void *), Thread *, Thread::Settings, void ( *)(void *), void *>>::operator() thread 271 0x63cd945
16 std::thread::_State_impl<std::thread::_Invoker<std::tuple<void ( *)(Thread *, Thread::Settings const&, void ( *)(void *), void *), Thread *, Thread::Settings, void ( *)(void *), void *>>>::_M_run thread 215 0x63cd59e
17 execute_native_thread_routine 0x68970c4
18 start_thread pthread_create.c 477 0x7ffff7b74609
19 clone clone.S 95 0x7ffff7944163
regenerate_word_cache
1 PoolVector<unsigned char>::Alloc::lock pool_vector.h 145 0x2bbca60
2 PoolVector<unsigned char>::Alloc::super_lock pool_vector.h 127 0x2c0da67
3 PoolVector<unsigned char>::_copy_on_write pool_vector.h 194 0x2c0d938
4 PoolVector<unsigned char>::write pool_vector.h 398 0x2c0db56
5 DynamicFontAtSize::_bitmap_to_character dynamic_font.cpp 534 0x59377f3
6 DynamicFontAtSize::_update_char dynamic_font.cpp 676 0x593ef21
7 DynamicFontAtSize::get_char_size dynamic_font.cpp 289 0x5936ccd
8 DynamicFont::get_char_size dynamic_font.cpp 908 0x5939b7f
9 Label::regenerate_word_cache label.cpp 468 0x5222777
10 Label::get_minimum_size label.cpp 292 0x5222d78
11 Control::_update_minimum_size_cache control.cpp 184 0x5159c63
12 Control::get_combined_minimum_size control.cpp 203 0x5159d7a
13 Control::_size_changed control.cpp 1194 0x5158dc5
14 Control::_notification control.cpp 481 0x515e87f
15 Control::_notificationv control.h 47 0x2c2462b
16 Label::_notificationv label.h 37 0x510441f
17 Object::notification object.cpp 927 0x624c35b
18 Node::_propagate_ready node.cpp 175 0x5079625
19 Node::_propagate_ready node.cpp 171 0x50795f2
20 Node::_propagate_ready node.cpp 171 0x50795f2
... <More>
draw_char
1 PoolVector<unsigned char>::Alloc::lock pool_vector.h 145 0x2bbca60
2 PoolVector<unsigned char>::Alloc::super_lock pool_vector.h 127 0x2c0da67
3 PoolVector<unsigned char>::_copy_on_write pool_vector.h 194 0x2c0d938
4 PoolVector<unsigned char>::write pool_vector.h 398 0x2c0db56
5 DynamicFontAtSize::_bitmap_to_character dynamic_font.cpp 534 0x59377f3
6 DynamicFontAtSize::_make_outline_char dynamic_font.cpp 626 0x5938abe
7 DynamicFontAtSize::_update_char dynamic_font.cpp 672 0x593edea
8 DynamicFontAtSize::draw_char dynamic_font.cpp 352 0x5937063
9 DynamicFont::draw_char dynamic_font.cpp 957 0x593a003
10 FontDrawer::draw_char font.h 94 0x5212444
11 Label::_notification label.cpp 266 0x5221a40
12 Label::_notificationv label.h 37 0x510447b
13 Object::notification object.cpp 927 0x624c35b
14 CanvasItem::_update_callback canvas_item.cpp 455 0x5653908
15 MethodBind0::call method_bind.gen.inc 59 0x2e1702c
16 Object::call object.cpp 918 0x624fa9c
17 MessageQueue::_call_function message_queue.cpp 241 0x6243849
18 MessageQueue::flush message_queue.cpp 284 0x6243b4f
19 SceneTree::iteration scene_tree.cpp 563 0x50b3dd2
20 Main::iteration main.cpp 2242 0x2bfd33f
... <More>
_update_minimum_size_cache
1 PoolVector<unsigned char>::Alloc::lock pool_vector.h 145 0x2bbca60
2 PoolVector<unsigned char>::Alloc::super_lock pool_vector.h 127 0x2c0da67
3 PoolVector<unsigned char>::_copy_on_write pool_vector.h 194 0x2c0d938
4 PoolVector<unsigned char>::write pool_vector.h 398 0x2c0db56
5 DynamicFontAtSize::_bitmap_to_character dynamic_font.cpp 534 0x59377f3
6 DynamicFontAtSize::_update_char dynamic_font.cpp 676 0x593ef21
7 DynamicFontAtSize::get_char_size dynamic_font.cpp 289 0x5936ccd
8 DynamicFont::get_char_size dynamic_font.cpp 908 0x5939b7f
9 Label::get_longest_line_width label.cpp 325 0x522309d
10 Label::regenerate_word_cache label.cpp 377 0x5221e0f
11 Label::get_minimum_size label.cpp 292 0x5222d78
12 Control::_update_minimum_size_cache control.cpp 184 0x5159c63
13 Control::get_combined_minimum_size control.cpp 203 0x5159d7a
14 Control::_size_changed control.cpp 1194 0x5158dc5
15 Control::_notification control.cpp 481 0x515e87f
16 Control::_notificationv control.h 47 0x2c2462b
17 Label::_notificationv label.h 37 0x510441f
18 Object::notification object.cpp 927 0x624c35b
19 Node::_propagate_ready node.cpp 175 0x5079625
20 Node::_propagate_ready node.cpp 171 0x50795f2
... <More>
get_char_size
1 PoolVector<unsigned char>::Alloc::lock pool_vector.h 145 0x2bbca60
2 PoolVector<unsigned char>::Alloc::super_lock pool_vector.h 127 0x2c0da67
3 PoolVector<unsigned char>::_copy_on_write pool_vector.h 194 0x2c0d938
4 PoolVector<unsigned char>::write pool_vector.h 398 0x2c0db56
5 DynamicFontAtSize::_bitmap_to_character dynamic_font.cpp 534 0x59377f3
6 DynamicFontAtSize::_update_char dynamic_font.cpp 676 0x593ef21
7 DynamicFontAtSize::get_char_size dynamic_font.cpp 289 0x5936ccd
8 DynamicFont::get_char_size dynamic_font.cpp 908 0x5939b7f
9 Font::get_string_size font.cpp 477 0x596f618
10 Button::get_minimum_size button.cpp 37 0x512f6e0
11 Control::_update_minimum_size_cache control.cpp 184 0x5159c63
12 Control::get_combined_minimum_size control.cpp 203 0x5159d7a
13 Control::_size_changed control.cpp 1194 0x5158dc5
14 Control::_notification control.cpp 481 0x515e87f
15 Control::_notificationv control.h 47 0x2c2462b
16 BaseButton::_notificationv base_button.h 39 0x4687b0f
17 Button::_notificationv button.h 37 0x46879ff
18 Object::notification object.cpp 927 0x624c35b
19 Node::_propagate_ready node.cpp 175 0x5079625
20 Node::_propagate_ready node.cpp 171 0x50795f2
... <More>
Blubits Demo
No contention in single threaded renderer. Multithreaded renderer, again during loading the scene:
texture_set_data
1 PoolVector<unsigned char>::Alloc::lock pool_vector.h 145 0x2bbca60
2 PoolVector<unsigned char>::Access::_ref pool_vector.h 304 0x2bbc9e7
3 PoolVector<unsigned char>::read pool_vector.h 391 0x2bba7ad
4 RasterizerStorageGLES3::texture_set_data rasterizer_storage_gles3.cpp 756 0x40f3116
5 VisualServerRaster::texture_set_data visual_server_raster.h 155 0x5ffbad2
6 CommandQueueMT::Command3<VisualServer, void (VisualServer:: *)(RID, Ref<Image> const&, int), RID, Ref<Image>, int>::call command_queue_mt.h 300 0x6086605
7 CommandQueueMT::flush_one command_queue_mt.h 441 0x5f7487a
8 CommandQueueMT::wait_and_flush_one command_queue_mt.h 478 0x5f6d287
9 VisualServerWrapMT::thread_loop visual_server_wrap_mt.cpp 78 0x6069bc8
10 VisualServerWrapMT::_thread_callback visual_server_wrap_mt.cpp 64 0x6069aed
11 Thread::callback thread.cpp 78 0x63ccbe7
12 std::__invoke_impl<void, void ( *)(Thread *, Thread::Settings const&, void ( *)(void *), void *), Thread *, Thread::Settings, void ( *)(void *), void *> invoke.h 60 0x63cdbea
13 std::__invoke<void ( *)(Thread *, Thread::Settings const&, void ( *)(void *), void *), Thread *, Thread::Settings, void ( *)(void *), void *> invoke.h 95 0x63cda61
14 std::thread::_Invoker<std::tuple<void ( *)(Thread *, Thread::Settings const&, void ( *)(void *), void *), Thread *, Thread::Settings, void ( *)(void *), void *>>::_M_invoke<0ul, 1ul, 2ul, 3ul, 4ul> thread 264 0x63cd9e0
15 std::thread::_Invoker<std::tuple<void ( *)(Thread *, Thread::Settings const&, void ( *)(void *), void *), Thread *, Thread::Settings, void ( *)(void *), void *>>::operator() thread 271 0x63cd945
16 std::thread::_State_impl<std::thread::_Invoker<std::tuple<void ( *)(Thread *, Thread::Settings const&, void ( *)(void *), void *), Thread *, Thread::Settings, void ( *)(void *), void *>>>::_M_run thread 215 0x63cd59e
17 execute_native_thread_routine 0x68970c4
18 start_thread pthread_create.c 477 0x7ffff7b74609
19 clone clone.S 95 0x7ffff7944163
regenerate_word_cache
1 PoolVector<unsigned char>::Alloc::lock pool_vector.h 145 0x2bbca60
2 PoolVector<unsigned char>::Alloc::super_lock pool_vector.h 127 0x2c0da67
3 PoolVector<unsigned char>::_copy_on_write pool_vector.h 194 0x2c0d938
4 PoolVector<unsigned char>::write pool_vector.h 398 0x2c0db56
5 DynamicFontAtSize::_bitmap_to_character dynamic_font.cpp 534 0x59377f3
6 DynamicFontAtSize::_update_char dynamic_font.cpp 676 0x593ef21
7 DynamicFontAtSize::get_char_size dynamic_font.cpp 289 0x5936ccd
8 DynamicFont::get_char_size dynamic_font.cpp 908 0x5939b7f
9 Label::get_longest_line_width label.cpp 325 0x522309d
10 Label::regenerate_word_cache label.cpp 377 0x5221e0f
11 Label::get_total_character_count label.cpp 631 0x522388a
12 Label::set_percent_visible label.cpp 599 0x5223a58
13 MethodBind1<float>::call method_bind.gen.inc 775 0x2e18d42
14 ClassDB::set_property class_db.cpp 1007 0x6193eda
15 Object::set object.cpp 422 0x624c49f
16 SceneState::instance packed_scene.cpp 272 0x5a3498f
17 PackedScene::instance packed_scene.cpp 1588 0x5a35a57
18 SceneState::instance packed_scene.cpp 163 0x5a336e8
19 PackedScene::instance packed_scene.cpp 1588 0x5a35a57
20 SceneTree::change_scene_to scene_tree.cpp 1366 0x50b8fbf
... <More>
Jetpaca
No contention in single-safe renderer.
With multithread renderer:
texture_set_data
Discussion
The contention only seems to be coming from a few places, perhaps this could be used to our advantage. Or maybe we could restrict locking to only happen in multithread renderer mode, as it only seems to be occurring there. It is possible that we haven't had issues with this before simply because few people are using the multithreaded renderer option (I'm not sure it offers significant benefits in terms of performance anyway).
We could also enable some debug logging to flag such contention in DEV_ENABLED
builds... :thinking:
Some performance timings of different release compiles of PoolVector
, in milliseconds, smaller times is therefore faster.
These are in gdscript so not super gold standard, because there will be a large fixed cost due to the scripting, but easier to rig up, and give us a ballpark idea of whether various approaches are faster or slower.
Here's the project: PoolVectorBenchmark.zip
Timings made on Intel Core i5 7500T 2.7ghz x 4, Linux Mint 20.
1024x1024 push backs, 10 runs (smaller is faster)
Old PoolVector : 1899, 1826, 1872, 1808 This PR, global lock : 1431, 1393, 1402, 1378 This PR, lock per alloc : 1579, 1583, 1639, 1593
1024x1024 sets, 10 runs (smaller is faster)
Old PoolVector : 1116, 1099, 1072, 1078 This PR, global lock : 1107, 1057, 1053, 1065 This PR, lock per alloc : 1173, 1185, 1239, 1198
Results
push_back
is very inefficient, and stresses the resize machinery, and requires a lock in both the old and new PoolVector
implementations. set
is lock free except in the case of the "lock per alloc" version.
The global lock approach was expected to be roughly similar to the old PoolVector
. As it happens, it turns out to be a bit faster for the push_back
s, and not much difference for the set
s.
The "lock per alloc" approach was expected to be slower during the set
s as this stresses the locking. This was the case and it is a smidgen slower than before, in exchange for the safety. For the push_back
s, it is faster than the old PoolVector
, but not by as much as the global lock version.
Discussion
The main thing is that this confirms that the PR with global lock is not slower than the old implementation, and together with fixing the 65535 limitation this makes a fairly compelling case for introducing it.
The lock per alloc turns out to not be prohibitively expensive either, so we can evaluate at a later date either moving to this version generally, or having this version run when the engine is in multithreaded mode.
I guess an over-simplication could just be, is PoolVector thread-safe or not? It seems to attempt to be thread-safe as
PoolVector<T>::operator[]
uses 'read' access,PoolVector<T>::set
uses a 'write' access, etc. If that is indeed the case, it would just be so much simpler to put a mutex on the object, and a std::lock_guard at the start of each api call that PoolVector exposes, which is pretty thin tbh. And most of the interface into actually using a PoolVector is through read & write access, which is partially doing that anyways right now.
Note that the actual data for PoolVector
is not stored in the PoolVector
(or owned directly), it is stored in the PoolVector::Alloc
, which is a reference counted object.
We could put a lock on PoolVector
functions, but it wouldn't help, because multiple PoolVector
s can be accessing the same PoolVector::Alloc
simultaneously. Thus it is the Alloc
that needs to be the thing that is locked, either with the global lock or lock per alloc.
The Alloc
can be optionally locked in this PR on all accesses (which I think is what you are getting at), that is what the GODOT_POOL_VECTOR_USE_LOCK_PER_ALLOC
define is intended to offer.
The Alloc
is already locked on all accesses, if you take a look at Access
, which is inherited by Read
and Write
in pool_vector.h
. What that basically means is that when there is either a read or write access object, the calling thread will straight up try and lock access anyways. So the thread safety of the class is totally bonkers right now and already quite half baked.
I put up a simple draft PR with here if of any interest. It probalby won't go anywhere,but it's quite clear and concise what interface does what wrt thread safety:
https://github.com/godotengine/godot/pull/59903
Also made a few quick comments on rocket chat as well. But i think the class, as is right now, is unnecessarily complicated (as sometimes evidenced by amount of comments).
I don't really have a strong preference though about merging this PR however, I just hope at some point the godot project can start to focus on refactoring, unit_testing the container classes, improving clarity, & cleanup etc. These classes shouldn't be that hard to follow. But I can't use the class in a contentious multi-threaded environment anyways, as currently too much potential for errors.
The
Alloc
is already locked on all accesses, if you take a look atAccess
, which is inherited byRead
andWrite
inpool_vector.h
. What that basically means is that when there is either a read or write access object, the calling thread will straight up try and lock access anyways. So the thread safety of the class is totally bonkers right now and already quite half baked.
You mean in this PR with the lock per alloc? This is the same as your PR does I think #59903 .
I've just seen that on closer examination the original does increment a lock SafeNumeric
:
https://github.com/godotengine/godot/blob/a4bbc87acc7a22f273452301da2d70ebf0e65914/core/pool_vector.h#L257-L268
Which I thought originally had no effects, but it turns out it is used to abort a resize if set: https://github.com/godotengine/godot/blob/a4bbc87acc7a22f273452301da2d70ebf0e65914/core/pool_vector.h#L527-L529
This probably means we should either go for the fine grained lock, or put a similar SafeNumeric
in coarse grained mode to emulate this behaviour (in which case it wouldn't be that much more expensive than to just use the mutex and be done with it). I'm pretty sure things would go horribly wrong shortly after a resize gets aborted.
Anyway I'm putting in an equivalent SafeNumeric
lock for the coarse grained mode and we'll see where we are, with this new info I'm inclined to also move towards the plan of just using a mutex per Alloc.
I put up a simple draft PR with here if of any interest. It probably won't go anywhere,but it's quite clear and concise what interface does what wrt thread safety:
I've been having a look, it seems to mostly do the same thing, it's more like this PR with GODOT_POOL_VECTOR_USE_LOCK_PER_ALLOC
set. When we spoke yesterday you seemed against locking for access, and preferred just for COW and resize and major stuff (which is what this PR does without GODOT_POOL_VECTOR_USE_LOCK_PER_ALLOC
defined), but you seem to have changed opinion since then.
But i think the class, as is right now, is unnecessarily complicated (as sometimes evidenced by amount of comments).
I do totally understand especially as it is currently covering two cases in one. If we decide in e.g. a PR meeting to go with one approach we can easily remove the alternative path. It may be once we have decided on a method e.g. the mutex per alloc that this looks very similar to your version. I may have a look at splitting this up into two PRs as I do get the point it is rather hard to follow.
It also may be the PR covers some special cases which may not be dealt with in your current version (but maybe the same is true vice versa, I am copying some stuff lol), as I remember there being at least a couple of "gotchas" to watch out for. Changing to locked guards is a good option, we use this technique in some other parts of Godot, I don't know if there is currently a standard form for this. I'm really following the lead of the original PoolVector
here using explicit locks.
I don't really have a strong preference though about merging this PR however, I just hope at some point the godot project can start to focus on refactoring, unit_testing the container classes, improving clarity, & cleanup etc.
Yes this is the aim for 4.x, for 3.x we are aiming not to change too much - the vast majority of effort is going into 4 at this point.
But I can't use the class in a contentious multi-threaded environment anyways, as currently too much potential for errors.
Absolutely, this is more of a band-aid - I can also see lots of potential failure points. But at this point we can't afford to rewrite 3.x to fit a new paradigm. The interface to PoolVector
ideally has to stay the same, we have to weigh up the cost of introducing a whole host of new bugs if we rewrote client code "properly". It isn't ideal, but that is the situation we have to deal with I guess.
UPDATE: I have decided splitting up into two PRs will be better, one for lock per alloc, and one more similar to the old version. Personally I think the slight performance cost is worth the extra safety of using a mutex per Alloc, and if necessary the calling code can be tidied up in bottlenecks. Version is still very much WIP though, there's probably some unsafe stuff in there as I've rejigged a fair bit, and I will be away tomorrow so will do a bit more later this week. I will borrow liberally from yours @jordo I'm sure as you are considerably more well versed in this area than myself.
If you are going to use some of that draft https://github.com/godotengine/godot/pull/59903 I can make a few quick comments to help you through my thoughts there.
To answer a question you asked me on rocket chat about why I removed the reference counter from being a SafeNumeric
to a plain uint32_t, it was intentional. SafeNumeric
modifications are atomic across threads (increment, drecement, etc), but modification, followed by a load and conditional branch are not thread safe. Preemption can happen after the load and before the conditional JNE or whatever branch, depending on compiler which may makes stuff like below not thread safe
if (alloc->refcount.get() == 1) {
return;
I see currently you lock around the cow code above, but it's just an example that sometimes using SafeNumeric
gives a false sense of thread safety, when the only thing it guarantees is an atomic modifier and nothing more really should be assumed, unless you specifically operate with load modes and start getting into things like atomic compare-and-swap loops. But the simplest view is to treat the modification as atomic and everything surrounding it, especially conditional executions, as not. If you only need a counter, it's great though.
But as we also discussed earlier, I really have no idea what this class is supposed to do and guarantee in terms of thread safety... I personally wouldn't lock on access, and leave that up to the application or a higher level, but my draft was supposed to just be a quick help or illustration of something clearer and simpler to understand.
Also, based on your comments here:
// A special race condition can occur if source_alloc->refcount is 0,
// and it is just about to be deleted. In this case, ref() will fail, and we
// should recover.
This is a good example of why refcount decrement and the free check on zero should be done in the destructor. If the decrement is done in destructor, and refcount reaches zero, there can logically be no other executable code that can assign to that object because there can be no other possible references to it. There's nothing to assign to anymore so refcount on assignment can never be zero.
It occurred to me that for the version where we lock the Alloc
when creating Access
, we can move almost all the functionality of the PoolVector
(push_backs etc) into the Write
, because by definition that will have already done COW and locking. Then the PoolVector
versions can be thin wrappers on the Write
versions.
Then in calling code we should try and encourage client code to use a Write
, and call the functions through that, rather than directly from PoolVector.
This has two effects:
- It means only a single lock and COW is needed (for the
Write
), and the other functions can be used without needing to do this, therefore they can be far more efficient. - Everything from the client code is thus effectively enclosed in a lock (caused by the Write), so the thread safety is greatly increased.
It will still work the old way of course, but there are some performance critical / multithreaded parts we can move over to this paradigm.
Some of the algorithms are now moved out, because LocalVector
already contains code for push_back
, resize
etc and we can just reuse these rather than repeating them in PoolVector
(more possibility for errors etc).
I've also done timings for this version and it is within a smidgeon of the old version, even with the extra safety etc, and is significantly faster for push back:
(figures are comparable to earlier benchmarks)
push_back
benchmark - 1077ms, 1093, 1096, 1105, 1069
set
benchmark - 1143ms, 1160, 1158, 1178, 1141