godot [3.x] Revamp PoolVector

Modify PoolVector internals to use LocalVector internally, and remove the hard limit on the number of PoolVectors.

While in master, PoolVector has been replaced wholesale with Vector, in order to minimize disruption to existing codebase in 3.x, I have been testing making PoolVector a simple wrapper around LocalVector giving it COW and refcounting and compatibility with the existing PoolVector interface.

Fixes #37154 Fixes #53665 Alternative to #54263

Update

When using DEV_ENABLED builds it will now throw a warning when contention is detected. This is a situation which could result in undefined behaviour when compiling without GODOT_POOL_VECTOR_USE_LOCK_PER_ALLOC. There are two separate methods for detecting contention depending on whether the PR is compiled using single global lock or lock per Alloc.

Thread safety

Although I originally started this PR in order to simply address the hard limit on the number of allocations, I've also noticed that there is a potential thread safety issue, which could lead to data corruption.

To be clear the original implementation uses a mutex for the copy_on_write and resize, which are the most important points to cover (because the data address may change), however it doesn't currently lock the Alloc for reads and writes.

Simple or pedantic locking

In order to cover more situations and detect whether this was a problem, this PR features two ways of handling thread safety:

Single global mutex used on resize and COW, mirroring the previous approach
Mutex per Alloc, which gives fine grained locking, and locks on all operations, including reserving a Read or a Write. (we could alternatively use a simpler locking primitive)

Both will print warnings (once to avoid spamming) in DEV_ENABLED builds where it detects contention - situations where more than one thread is attempting to read and write the same data at the same time. This has high potential to be a bug. It turns out that this situation is occurring with high frequency when the renderer is set to "multithreaded" mode (not the default), especially when loading scenes.

@jordo has pointed out that to some extent we could say the object T should be held responsible for thread safety within set and get operations, and this may apply for complex types. However many of the PoolVectors are used with simple types such as Vector3 so there is the possibility for mangled results, or more likely, reading an array half in one state and half in another.

Should COW protect against multithread access?

To some extent, it seems that COW should be in a position to provide some protection against multiple threads using the same Alloc. Providing a PoolVector is copied, rather than used via a pointer or by reference, then the new PoolVector should copy the Alloc on write, ensuring that the thread reading will read the same historical data.

This is probably working in a lot of situations, but not all. There appear to be some situations where the contention is happening in practice. There also appear to be a number of cases which may also be susceptible, such as passing data by const reference PoolVectors from the scene tree to the servers.

How to solve

It is difficult to know what the best way to "solve" this problem is. For now we are thinking of having version (1) active (i.e. the same behaviour as before), and see if we can detect contention in DEV_ENABLED builds and flag a warning, without affecting behaviour.

We could alternatively go with (2), however there are additional costs in terms of memory use (at the least 80 bytes per mutex, and possibly more), and potentially in terms of performance (a lot more locking - this hasn't seemed too noticable but I haven't benchmarked), and construction / destruction of the mutexes.

With (2) the paradigm becomes only one threads Read or Write can have access to the Alloc at a time (a single thread can do a double lock, as the mutex is recursive). This could lead to some coarse grained waiting if there is contention across threads, but it should be safe, and locking per Read / Write is potentially a lot more efficient than e.g. locking per element.

Notes

Still a WIP, particularly we need to decide on the locking method.
This version unlike the original PoolVector allocates no memory up front, and has no hard limits to the number of PoolVectors. This means memory RAM use is typically lower than before, and is probably a better approach to the limit than increasing it and increasing the fixed RAM cost.
I first tested this wrapping Vector, however this duplicate the COW functionality in PoolVector, so I'm now using LocalVector, which has no COW functionality, and it seems to be working fine so far.
Any locking / multithread safety is handled by PoolVector and PoolVector::Access, as LocalVector is not thread safe.
Debug memory counts are now implemented as before, using a mutex (there is the possibility to use atomics / SafeNumerics directly for these counters, and I have had this working, but would rather leave to a 2nd PR as the benefits for performance are not yet clear).
If we are happy with this we may decide later to put the PoolVector_Allocs into one of the CommonPools, using e.g. PagedAllocator, rather than make the dynamic allocations using malloc (although of course the vector allocations themselves for resizing will still use malloc).

Mar 23 '22 11:03 lawnjelly

@lawnjelly @reduz can we setup a discussion on contributors chat for this stuff? I feel like I'm really lost with whats going on with this stuff as of late, and I'm having a hard time understanding what changes are being made, what the problems are, what are the design goals, etc. I'm not opposed to any of this, and welcome all and any efforts to improve (there definitely needs to be improvement), but the changes are coming in hot and fast with not much formal discussion or technical proposals of what exactly the issues are, how any PR and/or refactoring directly addresses them, and benchmarks to examine .

The core fundamentals & data structures of the engine are in dire need of additional analysis & scrutiny, unit testing & benchmarks, and technical documentation that can be understood a group of contributors. This is a key area that I have, over the past year, spent too many hours modifying and working around to meet performance requirements of a game where we needed to be performance first.

- Can we define technically, what the design goals of PoolAllocator were, and how it accomplished them? Because this PR removes it completely. From what I can tell on a first glance, it's a pre-allocated block, which is/was used as a generic memory pool with a limited amount of compaction and resizing capabilities per Entry.
- Since this PR maintains the data structure called PoolVector, in 3.X we have Vector, LocalVector, and PoolVector (not that we don't already have them already). But what exactly is PoolVector, and what does it do? Is it documented anywhere? Is it effectively now just a reference counted, COW, LocalVector? Is it even supposed to be thread safe, or not? Does it have stable element references on iteration? etc.

The second comment above is more a general thought about the core structures of the engine, but it illustrates a point that none of this is really well communicated or formally documented anywhere.

As a contrast, we've moved almost all of our performance critical Dictionary & Hashmap operations to https://github.com/martinus/robin-hood-hashing (either it's flat_map, or node based memory backing layouts), and seen absolutely ridiculous performance improvements across the board and almost everywhere. This single data structure's test dir has an incredible number of unit tests and benchmarks (https://github.com/martinus/robin-hood-hashing/tree/master/src/test/unit). I just think it's a great example, that the foundational building blocks of a project as large as godot, especially data structures, need to have proper documentation, tests, and benchmark suites. Otherwise its very difficult to understand what's changing, what the goals are, and try to objectively determine what results in the best outcome performance wise.

With regards to the direct questions and comments that @lawnjelly brought up in this PR, there's a lot. Would we be able to schedule a contributors discussion and perhaps even post a more formal write up with what the problems are (other than just linked issue #'s), and what we are changing and why? I think that would be really helpful.

Mar 27 '22 23:03 jordo

@lawnjelly @reduz can we setup a discussion on contributors chat for this stuff? I feel like I'm really lost with whats going on with this stuff as of late, and I'm having a hard time understanding what changes are being made, what the problems are, what are the design goals, etc.

@jordo I'm in broad agreement with everything you've said there, and first of all "don't panic", everything especially in core tends to go through an extensive period of scrutiny and peer review before we consider merging anything - no one gets to unilaterally make changes on their own, even reduz PRs are open for review.

There should be plenty of time and interest for your input, and it is always encouraged as you are well versed in this area. Indeed as I said on rocket chat the other day you are very encouraged to make PRs or draft PRs / proposals (often we have different ways of approaching a problem, as here we have a linked PR by @Calinou which spurred me to have a look at the problem and experiment with this alternative approach), I think you said you'd be able to do more after a deadline at work.

A meeting would be great actually to discuss some of these core aspects in 3.x and 4.x. Makes it far easier to pin down Juan and get his input as he is usually busy with lots of things. :grinning:

Irrespective of meetings / short discussions on rocket chat, often we do the majority of discussion in comments on PRs too, as this works no matter what time zone you are in etc.

The core fundamentals & data structures of the engine are in dire need of additional analysis & scrutiny, unit testing & benchmarks, and technical documentation that can be understood a group of contributors. This is a key area that I have, over the past year, spent too many hours modifying and working around to meet performance requirements of a game where we needed to be performance first.

Agreed.

Can we define technically, what the design goals of PoolAllocator were, and how it accomplished them? Because this PR removes it completely. From what I can tell on a first glance, it's a pre-allocated block, which is/was used as a generic memory pool with a limited amount of compaction and resizing capabilities per Entry.

It was dead code, as far as I could see. Looked to be used historically from PoolVector, but not currently. So the policy is normally to remove dead code where possible (it remains in the git history so can be salvaged). Juan mentions in the PR that removes PoolVector in 4.x, that a lot of this may be historical stuff from when machines had very limited memory.

Since this PR maintains the data structure called PoolVector, in 3.X we have Vector, LocalVector, and PoolVector (not that we don't already have them already). But what exactly is PoolVector, and what does it do? Is it documented anywhere? Is it effectively now just a reference counted, COW, LocalVector? Is it even supposed to be thread safe, or not? Does it have stable element references on iteration? etc.

Yes again, this is lack of documentation, this should imo be at the top of the header file a description of what the template is, what it is meant to do etc. By reverse engineering I worked out it is meant to be a COW reference-counted vector. The multithread paradigm was totally unclear, but empirical testing for simultaneous access suggests to me it is being used from multiple threads, and it does not look currently thread safe to do so. Whether that constitutes a problem in the client code, or a problem in the template itself, I cannot say.

I just think it's a great example, that the foundational building blocks of a project as large as godot, especially data structures, need to have proper documentation, tests, and benchmark suites. Otherwise its very difficult to understand what's changing, what the goals are, and try to objectively determine what results in the best outcome performance wise.

Do agree, but to be pragmatic also - large engines like Godot can end up being the result of a large amount of historical stuff. Hence there is a lot of stuff using linked lists which should really be using vectors, vectors which could be local vectors etc etc. This was part of the impetus for the large changes in 4.x I believe.

Especially in 3.x (which this PR is aimed at), it is a rather special situation - we have to work with what we have, there isn't a huge scope for change (as we need backward compatibility, and we don't want to rewrite the rest of the codebase).

So in 3.x the limited core changes that we might make are likely to be:

Those that are necessary to fix bugs
Those that can relatively simply make current systems run more efficiently

Thus for 3.x there seem to be two bugs in the existing PoolVector, which would be nice to close:

Hard limit of 65535 (and associated fixed memory use scaling with this limit).
Possibly multithreading bug, either in use or within the template or both.

Mar 28 '22 09:03 lawnjelly

Just to note briefly discussed this this morning:

lawnjelly 9:43 AM reduz I've been having a bit of a look at PoolVector in 3.x. I realised you have changed it mostly to use Vector in 4.x, so I have been experimenting with changing the internals of PoolVector in 3.x to make it a wrapper but still maintain the same API, primarily in order to get around the limit of 65535 allocations. This has been going really well, and I have a draft PR: #59444

However what has foxed me is that while I can see PoolVector should be reference counted and COW, I'm unsure of whether it should be supporting multithreading. There is some effort at locks in the existing code, but it doesn't provide very much safety. And when I put some debugging in, it seems that in some builds of the engine (especially multithreaded renderer), and in some situations (loading?) PoolVectors do seem to be being used simultaneously from multiple threads. So what I'm wondering is, in your view, would it be better to tighten up PoolVector to make it as thread safe as possible, or to examine the calling code and try and prevent these multithread accesses?

reduz 📛 reduz 📛 9:49 AM For Godot 3 I would not worry a lot, you can just use Vector behind the back

lawnjelly lawnjelly 9:49 AM I must admit I'm slightly concerned that other than read situations, there is a great potential for things to go horribly wrong with multithread use.

reduz 📛 reduz 📛 9:49 AM i dont think this is used in multithreading much in G3

So this confirms that at least Juan isn't aware of any pre-existing scheme to handle multithread access here, and we should try and address it as best we can (given the practicalities and the desire to not change too much).

I will try and investigate the multithread access a bit more. There is always the option to do nothing to improve multithreading over the old implementation, but if there is some low hanging fruit here in terms of improvements it may be worth taking advantage (providing it doesn't compromise performance - some locking can be essentially free if it is rarely contended).

Mar 31 '22 09:03 lawnjelly

@lawnjelly can you provide a example where multithreaded access and/or write causes invalid access? This could also be checked empirically with tsan and a heavy-load test suite. But what critical sections do you want to lock/unlock on? There already looks to be many places in CowData<T> that look like it attempts to be thread-safe, but it's absolutely not. So the consideration of thread safety at the higher PoolVector layer maybe should just be punted on until some of the foundational structures are tightened up? Example:

https://github.com/godotengine/godot/blob/3.x/core/cowdata.h#L227-L228

This above is a good example of thread safety done incorrectly. _copy_on_write() is supposed to be thread safe as it trys to use a SafeNumeric, but you can obviously see in the above snippet, if execution is preempted after executing line 227 (loading rc to register), and another thread just so happens to increment the refcount from 1 to > 1 in between those lines, we end up with issues and problems.

The fact that this works most of the time and doesn't crash and do unexpected things randomly I'm guessing is just due to the fact that these critical sections aren't hit or hammered hard enough with multi-threaded load. Perhaps not many developers or projects using these data structures in such environments.

Another example in cowdata.h:

https://github.com/godotengine/godot/blob/3.x/core/cowdata.h#L200

The above leads to a double-free bug/exploit. Line 200 is not atomic, as the decrement + conditional evaluation is not an atomic operattion despite being a single line. This leads to double-free bug.

Mar 31 '22 16:03 jordo

Just to note we discussed this yesterday on rocketchat:

This PR is using LocalVector rather than Vector. LocalVector has no COW or refcounting, and PoolVector uses a different COW solution. @jordo 's comments about CowData don't apply here (although is still relevant for Vector of course), although the COW solution in this PR does need to be independently checked for thread safety.

Contention

Today I'll be trying a few projects and trying to ascertain an idea of how often contention happens in PoolVector, using a mutex per Alloc. I've done this by uncommenting the section in the PR in Alloc::lock(), where it can optionally call try_lock. If the try_lock fails, I print a WARNING and hit a breakpoint.

I try to use some real game projects rather than demos where possible, as these can better represent real world use cases. These are often closed source.

So far it seems that contention seems more a problem when multithread rendering is selected in the project setting, rather than the default single-safe. It also doesn't seem to be occurring so far during using the editor, but does during loading the games.

3D Platformer Demo

No evidence of contention.

Beat Invaders (@RPicster )

No contention in single-safe renderer. With multithread renderer there is a lot of contention. Not knowing the inner workings of the game I don't know whether there is some threading in the script, but I will show some stack traces here.

This seems to all be occurring in the initial loading phase of the game, when playing the game itself there is no contention so far.

texture_set_data

1  PoolVector<unsigned char>::Alloc::lock                                                                                                                                                                pool_vector.h                145 0x2bbca60      
2  PoolVector<unsigned char>::Access::_ref                                                                                                                                                               pool_vector.h                304 0x2bbc9e7      
3  PoolVector<unsigned char>::read                                                                                                                                                                       pool_vector.h                391 0x2bba7ad      
4  RasterizerStorageGLES3::texture_set_data                                                                                                                                                              rasterizer_storage_gles3.cpp 756 0x40f3116      
5  VisualServerRaster::texture_set_data                                                                                                                                                                  visual_server_raster.h       155 0x5ffbad2      
6  CommandQueueMT::Command3<VisualServer, void (VisualServer:: *)(RID, Ref<Image> const&, int), RID, Ref<Image>, int>::call                                                                              command_queue_mt.h           300 0x6086605      
7  CommandQueueMT::flush_one                                                                                                                                                                             command_queue_mt.h           441 0x5f7487a      
8  CommandQueueMT::wait_and_flush_one                                                                                                                                                                    command_queue_mt.h           478 0x5f6d287      
9  VisualServerWrapMT::thread_loop                                                                                                                                                                       visual_server_wrap_mt.cpp    78  0x6069bc8      
10 VisualServerWrapMT::_thread_callback                                                                                                                                                                  visual_server_wrap_mt.cpp    64  0x6069aed      
11 Thread::callback                                                                                                                                                                                      thread.cpp                   78  0x63ccbe7      
12 std::__invoke_impl<void, void ( *)(Thread *, Thread::Settings const&, void ( *)(void *), void *), Thread *, Thread::Settings, void ( *)(void *), void *>                                              invoke.h                     60  0x63cdbea      
13 std::__invoke<void ( *)(Thread *, Thread::Settings const&, void ( *)(void *), void *), Thread *, Thread::Settings, void ( *)(void *), void *>                                                         invoke.h                     95  0x63cda61      
14 std::thread::_Invoker<std::tuple<void ( *)(Thread *, Thread::Settings const&, void ( *)(void *), void *), Thread *, Thread::Settings, void ( *)(void *), void *>>::_M_invoke<0ul, 1ul, 2ul, 3ul, 4ul> thread                       264 0x63cd9e0      
15 std::thread::_Invoker<std::tuple<void ( *)(Thread *, Thread::Settings const&, void ( *)(void *), void *), Thread *, Thread::Settings, void ( *)(void *), void *>>::operator()                         thread                       271 0x63cd945      
16 std::thread::_State_impl<std::thread::_Invoker<std::tuple<void ( *)(Thread *, Thread::Settings const&, void ( *)(void *), void *), Thread *, Thread::Settings, void ( *)(void *), void *>>>::_M_run   thread                       215 0x63cd59e      
17 execute_native_thread_routine                                                                                                                                                                                                          0x68970c4      
18 start_thread                                                                                                                                                                                          pthread_create.c             477 0x7ffff7b74609 
19 clone                                                                                                                                                                                                 clone.S                      95  0x7ffff7944163

regenerate_word_cache

1   PoolVector<unsigned char>::Alloc::lock       pool_vector.h    145  0x2bbca60 
2   PoolVector<unsigned char>::Alloc::super_lock pool_vector.h    127  0x2c0da67 
3   PoolVector<unsigned char>::_copy_on_write    pool_vector.h    194  0x2c0d938 
4   PoolVector<unsigned char>::write             pool_vector.h    398  0x2c0db56 
5   DynamicFontAtSize::_bitmap_to_character      dynamic_font.cpp 534  0x59377f3 
6   DynamicFontAtSize::_update_char              dynamic_font.cpp 676  0x593ef21 
7   DynamicFontAtSize::get_char_size             dynamic_font.cpp 289  0x5936ccd 
8   DynamicFont::get_char_size                   dynamic_font.cpp 908  0x5939b7f 
9   Label::regenerate_word_cache                 label.cpp        468  0x5222777 
10  Label::get_minimum_size                      label.cpp        292  0x5222d78 
11  Control::_update_minimum_size_cache          control.cpp      184  0x5159c63 
12  Control::get_combined_minimum_size           control.cpp      203  0x5159d7a 
13  Control::_size_changed                       control.cpp      1194 0x5158dc5 
14  Control::_notification                       control.cpp      481  0x515e87f 
15  Control::_notificationv                      control.h        47   0x2c2462b 
16  Label::_notificationv                        label.h          37   0x510441f 
17  Object::notification                         object.cpp       927  0x624c35b 
18  Node::_propagate_ready                       node.cpp         175  0x5079625 
19  Node::_propagate_ready                       node.cpp         171  0x50795f2 
20  Node::_propagate_ready                       node.cpp         171  0x50795f2 
... <More>

draw_char

1   PoolVector<unsigned char>::Alloc::lock       pool_vector.h       145  0x2bbca60 
2   PoolVector<unsigned char>::Alloc::super_lock pool_vector.h       127  0x2c0da67 
3   PoolVector<unsigned char>::_copy_on_write    pool_vector.h       194  0x2c0d938 
4   PoolVector<unsigned char>::write             pool_vector.h       398  0x2c0db56 
5   DynamicFontAtSize::_bitmap_to_character      dynamic_font.cpp    534  0x59377f3 
6   DynamicFontAtSize::_make_outline_char        dynamic_font.cpp    626  0x5938abe 
7   DynamicFontAtSize::_update_char              dynamic_font.cpp    672  0x593edea 
8   DynamicFontAtSize::draw_char                 dynamic_font.cpp    352  0x5937063 
9   DynamicFont::draw_char                       dynamic_font.cpp    957  0x593a003 
10  FontDrawer::draw_char                        font.h              94   0x5212444 
11  Label::_notification                         label.cpp           266  0x5221a40 
12  Label::_notificationv                        label.h             37   0x510447b 
13  Object::notification                         object.cpp          927  0x624c35b 
14  CanvasItem::_update_callback                 canvas_item.cpp     455  0x5653908 
15  MethodBind0::call                            method_bind.gen.inc 59   0x2e1702c 
16  Object::call                                 object.cpp          918  0x624fa9c 
17  MessageQueue::_call_function                 message_queue.cpp   241  0x6243849 
18  MessageQueue::flush                          message_queue.cpp   284  0x6243b4f 
19  SceneTree::iteration                         scene_tree.cpp      563  0x50b3dd2 
20  Main::iteration                              main.cpp            2242 0x2bfd33f 
... <More>

_update_minimum_size_cache

1   PoolVector<unsigned char>::Alloc::lock       pool_vector.h    145  0x2bbca60 
2   PoolVector<unsigned char>::Alloc::super_lock pool_vector.h    127  0x2c0da67 
3   PoolVector<unsigned char>::_copy_on_write    pool_vector.h    194  0x2c0d938 
4   PoolVector<unsigned char>::write             pool_vector.h    398  0x2c0db56 
5   DynamicFontAtSize::_bitmap_to_character      dynamic_font.cpp 534  0x59377f3 
6   DynamicFontAtSize::_update_char              dynamic_font.cpp 676  0x593ef21 
7   DynamicFontAtSize::get_char_size             dynamic_font.cpp 289  0x5936ccd 
8   DynamicFont::get_char_size                   dynamic_font.cpp 908  0x5939b7f 
9   Label::get_longest_line_width                label.cpp        325  0x522309d 
10  Label::regenerate_word_cache                 label.cpp        377  0x5221e0f 
11  Label::get_minimum_size                      label.cpp        292  0x5222d78 
12  Control::_update_minimum_size_cache          control.cpp      184  0x5159c63 
13  Control::get_combined_minimum_size           control.cpp      203  0x5159d7a 
14  Control::_size_changed                       control.cpp      1194 0x5158dc5 
15  Control::_notification                       control.cpp      481  0x515e87f 
16  Control::_notificationv                      control.h        47   0x2c2462b 
17  Label::_notificationv                        label.h          37   0x510441f 
18  Object::notification                         object.cpp       927  0x624c35b 
19  Node::_propagate_ready                       node.cpp         175  0x5079625 
20  Node::_propagate_ready                       node.cpp         171  0x50795f2 
... <More>

get_char_size

1   PoolVector<unsigned char>::Alloc::lock       pool_vector.h    145  0x2bbca60 
2   PoolVector<unsigned char>::Alloc::super_lock pool_vector.h    127  0x2c0da67 
3   PoolVector<unsigned char>::_copy_on_write    pool_vector.h    194  0x2c0d938 
4   PoolVector<unsigned char>::write             pool_vector.h    398  0x2c0db56 
5   DynamicFontAtSize::_bitmap_to_character      dynamic_font.cpp 534  0x59377f3 
6   DynamicFontAtSize::_update_char              dynamic_font.cpp 676  0x593ef21 
7   DynamicFontAtSize::get_char_size             dynamic_font.cpp 289  0x5936ccd 
8   DynamicFont::get_char_size                   dynamic_font.cpp 908  0x5939b7f 
9   Font::get_string_size                        font.cpp         477  0x596f618 
10  Button::get_minimum_size                     button.cpp       37   0x512f6e0 
11  Control::_update_minimum_size_cache          control.cpp      184  0x5159c63 
12  Control::get_combined_minimum_size           control.cpp      203  0x5159d7a 
13  Control::_size_changed                       control.cpp      1194 0x5158dc5 
14  Control::_notification                       control.cpp      481  0x515e87f 
15  Control::_notificationv                      control.h        47   0x2c2462b 
16  BaseButton::_notificationv                   base_button.h    39   0x4687b0f 
17  Button::_notificationv                       button.h         37   0x46879ff 
18  Object::notification                         object.cpp       927  0x624c35b 
19  Node::_propagate_ready                       node.cpp         175  0x5079625 
20  Node::_propagate_ready                       node.cpp         171  0x50795f2 
... <More>

Blubits Demo

No contention in single threaded renderer. Multithreaded renderer, again during loading the scene:

texture_set_data

1  PoolVector<unsigned char>::Alloc::lock                                                                                                                                                                pool_vector.h                145 0x2bbca60      
2  PoolVector<unsigned char>::Access::_ref                                                                                                                                                               pool_vector.h                304 0x2bbc9e7      
3  PoolVector<unsigned char>::read                                                                                                                                                                       pool_vector.h                391 0x2bba7ad      
4  RasterizerStorageGLES3::texture_set_data                                                                                                                                                              rasterizer_storage_gles3.cpp 756 0x40f3116      
5  VisualServerRaster::texture_set_data                                                                                                                                                                  visual_server_raster.h       155 0x5ffbad2      
6  CommandQueueMT::Command3<VisualServer, void (VisualServer:: *)(RID, Ref<Image> const&, int), RID, Ref<Image>, int>::call                                                                              command_queue_mt.h           300 0x6086605      
7  CommandQueueMT::flush_one                                                                                                                                                                             command_queue_mt.h           441 0x5f7487a      
8  CommandQueueMT::wait_and_flush_one                                                                                                                                                                    command_queue_mt.h           478 0x5f6d287      
9  VisualServerWrapMT::thread_loop                                                                                                                                                                       visual_server_wrap_mt.cpp    78  0x6069bc8      
10 VisualServerWrapMT::_thread_callback                                                                                                                                                                  visual_server_wrap_mt.cpp    64  0x6069aed      
11 Thread::callback                                                                                                                                                                                      thread.cpp                   78  0x63ccbe7      
12 std::__invoke_impl<void, void ( *)(Thread *, Thread::Settings const&, void ( *)(void *), void *), Thread *, Thread::Settings, void ( *)(void *), void *>                                              invoke.h                     60  0x63cdbea      
13 std::__invoke<void ( *)(Thread *, Thread::Settings const&, void ( *)(void *), void *), Thread *, Thread::Settings, void ( *)(void *), void *>                                                         invoke.h                     95  0x63cda61      
14 std::thread::_Invoker<std::tuple<void ( *)(Thread *, Thread::Settings const&, void ( *)(void *), void *), Thread *, Thread::Settings, void ( *)(void *), void *>>::_M_invoke<0ul, 1ul, 2ul, 3ul, 4ul> thread                       264 0x63cd9e0      
15 std::thread::_Invoker<std::tuple<void ( *)(Thread *, Thread::Settings const&, void ( *)(void *), void *), Thread *, Thread::Settings, void ( *)(void *), void *>>::operator()                         thread                       271 0x63cd945      
16 std::thread::_State_impl<std::thread::_Invoker<std::tuple<void ( *)(Thread *, Thread::Settings const&, void ( *)(void *), void *), Thread *, Thread::Settings, void ( *)(void *), void *>>>::_M_run   thread                       215 0x63cd59e      
17 execute_native_thread_routine                                                                                                                                                                                                          0x68970c4      
18 start_thread                                                                                                                                                                                          pthread_create.c             477 0x7ffff7b74609 
19 clone                                                                                                                                                                                                 clone.S                      95  0x7ffff7944163

regenerate_word_cache

1   PoolVector<unsigned char>::Alloc::lock       pool_vector.h       145  0x2bbca60 
2   PoolVector<unsigned char>::Alloc::super_lock pool_vector.h       127  0x2c0da67 
3   PoolVector<unsigned char>::_copy_on_write    pool_vector.h       194  0x2c0d938 
4   PoolVector<unsigned char>::write             pool_vector.h       398  0x2c0db56 
5   DynamicFontAtSize::_bitmap_to_character      dynamic_font.cpp    534  0x59377f3 
6   DynamicFontAtSize::_update_char              dynamic_font.cpp    676  0x593ef21 
7   DynamicFontAtSize::get_char_size             dynamic_font.cpp    289  0x5936ccd 
8   DynamicFont::get_char_size                   dynamic_font.cpp    908  0x5939b7f 
9   Label::get_longest_line_width                label.cpp           325  0x522309d 
10  Label::regenerate_word_cache                 label.cpp           377  0x5221e0f 
11  Label::get_total_character_count             label.cpp           631  0x522388a 
12  Label::set_percent_visible                   label.cpp           599  0x5223a58 
13  MethodBind1<float>::call                     method_bind.gen.inc 775  0x2e18d42 
14  ClassDB::set_property                        class_db.cpp        1007 0x6193eda 
15  Object::set                                  object.cpp          422  0x624c49f 
16  SceneState::instance                         packed_scene.cpp    272  0x5a3498f 
17  PackedScene::instance                        packed_scene.cpp    1588 0x5a35a57 
18  SceneState::instance                         packed_scene.cpp    163  0x5a336e8 
19  PackedScene::instance                        packed_scene.cpp    1588 0x5a35a57 
20  SceneTree::change_scene_to                   scene_tree.cpp      1366 0x50b8fbf 
... <More>

Jetpaca

No contention in single-safe renderer. With multithread renderer: texture_set_data

Discussion

The contention only seems to be coming from a few places, perhaps this could be used to our advantage. Or maybe we could restrict locking to only happen in multithread renderer mode, as it only seems to be occurring there. It is possible that we haven't had issues with this before simply because few people are using the multithreaded renderer option (I'm not sure it offers significant benefits in terms of performance anyway).

We could also enable some debug logging to flag such contention in DEV_ENABLED builds... :thinking:

Apr 01 '22 08:04 lawnjelly

Some performance timings of different release compiles of PoolVector, in milliseconds, smaller times is therefore faster.

These are in gdscript so not super gold standard, because there will be a large fixed cost due to the scripting, but easier to rig up, and give us a ballpark idea of whether various approaches are faster or slower.

Here's the project: PoolVectorBenchmark.zip

Timings made on Intel Core i5 7500T 2.7ghz x 4, Linux Mint 20.

1024x1024 push backs, 10 runs (smaller is faster)

Old PoolVector : 1899, 1826, 1872, 1808 This PR, global lock : 1431, 1393, 1402, 1378 This PR, lock per alloc : 1579, 1583, 1639, 1593

1024x1024 sets, 10 runs (smaller is faster)

Old PoolVector : 1116, 1099, 1072, 1078 This PR, global lock : 1107, 1057, 1053, 1065 This PR, lock per alloc : 1173, 1185, 1239, 1198

Results

push_back is very inefficient, and stresses the resize machinery, and requires a lock in both the old and new PoolVector implementations. set is lock free except in the case of the "lock per alloc" version.

The global lock approach was expected to be roughly similar to the old PoolVector. As it happens, it turns out to be a bit faster for the push_backs, and not much difference for the sets.

The "lock per alloc" approach was expected to be slower during the sets as this stresses the locking. This was the case and it is a smidgen slower than before, in exchange for the safety. For the push_backs, it is faster than the old PoolVector, but not by as much as the global lock version.

Discussion

The main thing is that this confirms that the PR with global lock is not slower than the old implementation, and together with fixing the 65535 limitation this makes a fairly compelling case for introducing it.

The lock per alloc turns out to not be prohibitively expensive either, so we can evaluate at a later date either moving to this version generally, or having this version run when the engine is in multithreaded mode.

Apr 04 '22 10:04 lawnjelly

I guess an over-simplication could just be, is PoolVector thread-safe or not? It seems to attempt to be thread-safe as PoolVector<T>::operator[] uses 'read' access, PoolVector<T>::set uses a 'write' access, etc. If that is indeed the case, it would just be so much simpler to put a mutex on the object, and a std::lock_guard at the start of each api call that PoolVector exposes, which is pretty thin tbh. And most of the interface into actually using a PoolVector is through read & write access, which is partially doing that anyways right now.

Note that the actual data for PoolVector is not stored in the PoolVector (or owned directly), it is stored in the PoolVector::Alloc, which is a reference counted object.

We could put a lock on PoolVector functions, but it wouldn't help, because multiple PoolVectors can be accessing the same PoolVector::Alloc simultaneously. Thus it is the Alloc that needs to be the thing that is locked, either with the global lock or lock per alloc.

The Alloc can be optionally locked in this PR on all accesses (which I think is what you are getting at), that is what the GODOT_POOL_VECTOR_USE_LOCK_PER_ALLOC define is intended to offer.

Apr 05 '22 06:04 lawnjelly

The Alloc is already locked on all accesses, if you take a look at Access, which is inherited by Read and Write in pool_vector.h. What that basically means is that when there is either a read or write access object, the calling thread will straight up try and lock access anyways. So the thread safety of the class is totally bonkers right now and already quite half baked.

I put up a simple draft PR with here if of any interest. It probalby won't go anywhere,but it's quite clear and concise what interface does what wrt thread safety:

https://github.com/godotengine/godot/pull/59903

Also made a few quick comments on rocket chat as well. But i think the class, as is right now, is unnecessarily complicated (as sometimes evidenced by amount of comments).

I don't really have a strong preference though about merging this PR however, I just hope at some point the godot project can start to focus on refactoring, unit_testing the container classes, improving clarity, & cleanup etc. These classes shouldn't be that hard to follow. But I can't use the class in a contentious multi-threaded environment anyways, as currently too much potential for errors.

Apr 05 '22 06:04 jordo

The Alloc is already locked on all accesses, if you take a look at Access, which is inherited by Read and Write in pool_vector.h. What that basically means is that when there is either a read or write access object, the calling thread will straight up try and lock access anyways. So the thread safety of the class is totally bonkers right now and already quite half baked.

You mean in this PR with the lock per alloc? This is the same as your PR does I think #59903 .

I've just seen that on closer examination the original does increment a lock SafeNumeric: https://github.com/godotengine/godot/blob/a4bbc87acc7a22f273452301da2d70ebf0e65914/core/pool_vector.h#L257-L268

Which I thought originally had no effects, but it turns out it is used to abort a resize if set: https://github.com/godotengine/godot/blob/a4bbc87acc7a22f273452301da2d70ebf0e65914/core/pool_vector.h#L527-L529

This probably means we should either go for the fine grained lock, or put a similar SafeNumeric in coarse grained mode to emulate this behaviour (in which case it wouldn't be that much more expensive than to just use the mutex and be done with it). I'm pretty sure things would go horribly wrong shortly after a resize gets aborted.

Anyway I'm putting in an equivalent SafeNumeric lock for the coarse grained mode and we'll see where we are, with this new info I'm inclined to also move towards the plan of just using a mutex per Alloc.

I put up a simple draft PR with here if of any interest. It probably won't go anywhere,but it's quite clear and concise what interface does what wrt thread safety:

I've been having a look, it seems to mostly do the same thing, it's more like this PR with GODOT_POOL_VECTOR_USE_LOCK_PER_ALLOC set. When we spoke yesterday you seemed against locking for access, and preferred just for COW and resize and major stuff (which is what this PR does without GODOT_POOL_VECTOR_USE_LOCK_PER_ALLOC defined), but you seem to have changed opinion since then.

But i think the class, as is right now, is unnecessarily complicated (as sometimes evidenced by amount of comments).

I do totally understand especially as it is currently covering two cases in one. If we decide in e.g. a PR meeting to go with one approach we can easily remove the alternative path. It may be once we have decided on a method e.g. the mutex per alloc that this looks very similar to your version. I may have a look at splitting this up into two PRs as I do get the point it is rather hard to follow.

It also may be the PR covers some special cases which may not be dealt with in your current version (but maybe the same is true vice versa, I am copying some stuff lol), as I remember there being at least a couple of "gotchas" to watch out for. Changing to locked guards is a good option, we use this technique in some other parts of Godot, I don't know if there is currently a standard form for this. I'm really following the lead of the original PoolVector here using explicit locks.

I don't really have a strong preference though about merging this PR however, I just hope at some point the godot project can start to focus on refactoring, unit_testing the container classes, improving clarity, & cleanup etc.

Yes this is the aim for 4.x, for 3.x we are aiming not to change too much - the vast majority of effort is going into 4 at this point.

But I can't use the class in a contentious multi-threaded environment anyways, as currently too much potential for errors.

Absolutely, this is more of a band-aid - I can also see lots of potential failure points. But at this point we can't afford to rewrite 3.x to fit a new paradigm. The interface to PoolVector ideally has to stay the same, we have to weigh up the cost of introducing a whole host of new bugs if we rewrote client code "properly". It isn't ideal, but that is the situation we have to deal with I guess.

UPDATE: I have decided splitting up into two PRs will be better, one for lock per alloc, and one more similar to the old version. Personally I think the slight performance cost is worth the extra safety of using a mutex per Alloc, and if necessary the calling code can be tidied up in bottlenecks. Version is still very much WIP though, there's probably some unsafe stuff in there as I've rejigged a fair bit, and I will be away tomorrow so will do a bit more later this week. I will borrow liberally from yours @jordo I'm sure as you are considerably more well versed in this area than myself.

Apr 05 '22 09:04 lawnjelly

If you are going to use some of that draft https://github.com/godotengine/godot/pull/59903 I can make a few quick comments to help you through my thoughts there.

To answer a question you asked me on rocket chat about why I removed the reference counter from being a SafeNumeric to a plain uint32_t, it was intentional. SafeNumeric modifications are atomic across threads (increment, drecement, etc), but modification, followed by a load and conditional branch are not thread safe. Preemption can happen after the load and before the conditional JNE or whatever branch, depending on compiler which may makes stuff like below not thread safe

if (alloc->refcount.get() == 1) {
    return;

I see currently you lock around the cow code above, but it's just an example that sometimes using SafeNumeric gives a false sense of thread safety, when the only thing it guarantees is an atomic modifier and nothing more really should be assumed, unless you specifically operate with load modes and start getting into things like atomic compare-and-swap loops. But the simplest view is to treat the modification as atomic and everything surrounding it, especially conditional executions, as not. If you only need a counter, it's great though.

But as we also discussed earlier, I really have no idea what this class is supposed to do and guarantee in terms of thread safety... I personally wouldn't lock on access, and leave that up to the application or a higher level, but my draft was supposed to just be a quick help or illustration of something clearer and simpler to understand.

Also, based on your comments here:

		// A special race condition can occur if source_alloc->refcount is 0,
		// and it is just about to be deleted. In this case, ref() will fail, and we
		// should recover.

This is a good example of why refcount decrement and the free check on zero should be done in the destructor. If the decrement is done in destructor, and refcount reaches zero, there can logically be no other executable code that can assign to that object because there can be no other possible references to it. There's nothing to assign to anymore so refcount on assignment can never be zero.

Apr 05 '22 23:04 jordo

It occurred to me that for the version where we lock the Alloc when creating Access, we can move almost all the functionality of the PoolVector (push_backs etc) into the Write, because by definition that will have already done COW and locking. Then the PoolVector versions can be thin wrappers on the Write versions.

Then in calling code we should try and encourage client code to use a Write, and call the functions through that, rather than directly from PoolVector.

This has two effects:

It means only a single lock and COW is needed (for the Write), and the other functions can be used without needing to do this, therefore they can be far more efficient.
Everything from the client code is thus effectively enclosed in a lock (caused by the Write), so the thread safety is greatly increased.

It will still work the old way of course, but there are some performance critical / multithreaded parts we can move over to this paradigm.

Some of the algorithms are now moved out, because LocalVector already contains code for push_back, resize etc and we can just reuse these rather than repeating them in PoolVector (more possibility for errors etc).

I've also done timings for this version and it is within a smidgeon of the old version, even with the extra safety etc, and is significantly faster for push back:

(figures are comparable to earlier benchmarks) push_back benchmark - 1077ms, 1093, 1096, 1105, 1069 set benchmark - 1143ms, 1160, 1158, 1178, 1141

Apr 19 '22 11:04 lawnjelly

godot godot copied to clipboard

[3.x] Revamp PoolVector

Update

Thread safety

Simple or pedantic locking

Should COW protect against multithread access?

How to solve

Notes

Contention

3D Platformer Demo

Beat Invaders (@RPicster )

Blubits Demo

Jetpaca

Discussion

1024x1024 push backs, 10 runs (smaller is faster)

1024x1024 sets, 10 runs (smaller is faster)

Results

Discussion

godot
godot copied to clipboard